!pip install -U turicreate
!pip install seaborn
!pip install tensorflow
!pip install -q keras
!pip install deepctr[gpu]
Collecting turicreate
Downloading https://files.pythonhosted.org/packages/25/9f/a76acc465d873d217f05eac4846bd73d640b9db6d6f4a3c29ad92650fbbe/turicreate-6.4.1-cp37-cp37m-manylinux1_x86_64.whl (92.0MB)
|████████████████████████████████| 92.0MB 46kB/s
Collecting numba<0.51.0
Downloading https://files.pythonhosted.org/packages/04/be/8c88cee3366de2a3a23a9ff1a8be34e79ad1eb1ceb0d0e33aca83655ac3c/numba-0.50.1-cp37-cp37m-manylinux2014_x86_64.whl (3.6MB)
|████████████████████████████████| 3.6MB 42.6MB/s
Requirement already satisfied, skipping upgrade: pandas>=0.23.2 in /usr/local/lib/python3.7/dist-packages (from turicreate) (1.1.5)
Collecting coremltools==3.3
Downloading https://files.pythonhosted.org/packages/1b/1d/b1a99beca7355b6a026ae61fd8d3d36136e5b36f13e92ec5f81aceffc7f1/coremltools-3.3-cp37-none-manylinux1_x86_64.whl (3.5MB)
|████████████████████████████████| 3.5MB 47.9MB/s
Requirement already satisfied, skipping upgrade: decorator>=4.0.9 in /usr/local/lib/python3.7/dist-packages (from turicreate) (4.4.2)
Requirement already satisfied, skipping upgrade: requests>=2.9.1 in /usr/local/lib/python3.7/dist-packages (from turicreate) (2.23.0)
Collecting tensorflow<2.1.0,>=2.0.0
Downloading https://files.pythonhosted.org/packages/3c/b3/3eeae9bc44039ceadceac0c7ba1cc8b1482b172810b3d7624a1cad251437/tensorflow-2.0.4-cp37-cp37m-manylinux2010_x86_64.whl (86.4MB)
|████████████████████████████████| 86.4MB 52kB/s
Requirement already satisfied, skipping upgrade: pillow>=5.2.0 in /usr/local/lib/python3.7/dist-packages (from turicreate) (7.1.2)
Requirement already satisfied, skipping upgrade: numpy in /usr/local/lib/python3.7/dist-packages (from turicreate) (1.19.5)
Requirement already satisfied, skipping upgrade: scipy>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from turicreate) (1.4.1)
Collecting resampy==0.2.1
Downloading https://files.pythonhosted.org/packages/14/b6/66a06d85474190b50aee1a6c09cdc95bb405ac47338b27e9b21409da1760/resampy-0.2.1.tar.gz (322kB)
|████████████████████████████████| 327kB 51.5MB/s
Requirement already satisfied, skipping upgrade: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from turicreate) (1.15.0)
Collecting prettytable==0.7.2
Downloading https://files.pythonhosted.org/packages/ef/30/4b0746848746ed5941f052479e7c23d2b56d174b82f4fd34a25e389831f5/prettytable-0.7.2.tar.bz2
Collecting llvmlite<0.34,>=0.33.0.dev0
Downloading https://files.pythonhosted.org/packages/0a/28/0a35b3c2685bf2ea327cef5577bdf91f387f0f4594417a2a05a1d42fb7c2/llvmlite-0.33.0-cp37-cp37m-manylinux1_x86_64.whl (18.3MB)
|████████████████████████████████| 18.3MB 1.2MB/s
Requirement already satisfied, skipping upgrade: setuptools in /usr/local/lib/python3.7/dist-packages (from numba<0.51.0->turicreate) (56.1.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23.2->turicreate) (2.8.1)
Requirement already satisfied, skipping upgrade: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23.2->turicreate) (2018.9)
Requirement already satisfied, skipping upgrade: protobuf>=3.1.0 in /usr/local/lib/python3.7/dist-packages (from coremltools==3.3->turicreate) (3.12.4)
Requirement already satisfied, skipping upgrade: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.1->turicreate) (2020.12.5)
Requirement already satisfied, skipping upgrade: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.1->turicreate) (3.0.4)
Requirement already satisfied, skipping upgrade: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.1->turicreate) (1.24.3)
Requirement already satisfied, skipping upgrade: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests>=2.9.1->turicreate) (2.10)
Collecting tensorboard<2.1.0,>=2.0.0
Downloading https://files.pythonhosted.org/packages/76/54/99b9d5d52d5cb732f099baaaf7740403e83fe6b0cedde940fabd2b13d75a/tensorboard-2.0.2-py3-none-any.whl (3.8MB)
|████████████████████████████████| 3.8MB 44.9MB/s
Requirement already satisfied, skipping upgrade: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (3.3.0)
Requirement already satisfied, skipping upgrade: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (1.1.0)
Requirement already satisfied, skipping upgrade: h5py<=2.10.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (2.10.0)
Requirement already satisfied, skipping upgrade: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (0.36.2)
Requirement already satisfied, skipping upgrade: google-pasta>=0.1.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (0.2.0)
Collecting keras-applications>=1.0.8
Downloading https://files.pythonhosted.org/packages/71/e3/19762fdfc62877ae9102edf6342d71b28fbfd9dea3d2f96a882ce099b03f/Keras_Applications-1.0.8-py3-none-any.whl (50kB)
|████████████████████████████████| 51kB 5.6MB/s
Requirement already satisfied, skipping upgrade: grpcio>=1.8.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (1.32.0)
Requirement already satisfied, skipping upgrade: astor>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (0.8.1)
Collecting tensorflow-estimator<2.1.0,>=2.0.0
Downloading https://files.pythonhosted.org/packages/fc/08/8b927337b7019c374719145d1dceba21a8bb909b93b1ad6f8fb7d22c1ca1/tensorflow_estimator-2.0.1-py2.py3-none-any.whl (449kB)
|████████████████████████████████| 450kB 27.7MB/s
Requirement already satisfied, skipping upgrade: wrapt>=1.11.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (1.12.1)
Collecting gast==0.2.2
Downloading https://files.pythonhosted.org/packages/4e/35/11749bf99b2d4e3cceb4d55ca22590b0d7c2c62b9de38ac4a4a7f4687421/gast-0.2.2.tar.gz
Requirement already satisfied, skipping upgrade: absl-py>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (0.12.0)
Requirement already satisfied, skipping upgrade: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.7/dist-packages (from tensorflow<2.1.0,>=2.0.0->turicreate) (1.1.2)
Requirement already satisfied, skipping upgrade: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (0.4.4)
Requirement already satisfied, skipping upgrade: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (3.3.4)
Requirement already satisfied, skipping upgrade: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (1.0.1)
Requirement already satisfied, skipping upgrade: google-auth<2,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (1.28.1)
Requirement already satisfied, skipping upgrade: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (1.3.0)
Requirement already satisfied, skipping upgrade: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (3.10.1)
Requirement already satisfied, skipping upgrade: rsa<5,>=3.1.4; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (4.7.2)
Requirement already satisfied, skipping upgrade: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (0.2.8)
Requirement already satisfied, skipping upgrade: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (4.2.1)
Requirement already satisfied, skipping upgrade: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (3.1.0)
Requirement already satisfied, skipping upgrade: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (3.7.4.3)
Requirement already satisfied, skipping upgrade: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (3.4.1)
Requirement already satisfied, skipping upgrade: pyasn1>=0.1.3 in /usr/local/lib/python3.7/dist-packages (from rsa<5,>=3.1.4; python_version >= "3.6"->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow<2.1.0,>=2.0.0->turicreate) (0.4.8)
Building wheels for collected packages: resampy, prettytable, gast
Building wheel for resampy (setup.py) ... done
Created wheel for resampy: filename=resampy-0.2.1-cp37-none-any.whl size=320845 sha256=937ac5eb4a13d9957636870281469040c93e73d2a31f2b462e87d881a49ee03a
Stored in directory: /root/.cache/pip/wheels/ff/4f/ed/2e6c676c23efe5394bb40ade50662e90eb46e29b48324c5f9b
Building wheel for prettytable (setup.py) ... done
Created wheel for prettytable: filename=prettytable-0.7.2-cp37-none-any.whl size=13700 sha256=49b1e961c578a8d3dbbbacac8d94b8cc4079f54da598f4d83484a8e478fec918
Stored in directory: /root/.cache/pip/wheels/80/34/1c/3967380d9676d162cb59513bd9dc862d0584e045a162095606
Building wheel for gast (setup.py) ... done
Created wheel for gast: filename=gast-0.2.2-cp37-none-any.whl size=7540 sha256=d78892564300685433cbb6ae3d61bc2c4cbb0a6b378b08aead2de9a200def65b
Stored in directory: /root/.cache/pip/wheels/5c/2e/7e/a1d4d4fcebe6c381f378ce7743a3ced3699feb89bcfbdadadd
Successfully built resampy prettytable gast
ERROR: tensorflow 2.0.4 has requirement numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.5 which is incompatible.
ERROR: tensorflow-probability 0.12.1 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.
ERROR: librosa 0.8.0 has requirement resampy>=0.2.2, but you'll have resampy 0.2.1 which is incompatible.
Installing collected packages: llvmlite, numba, coremltools, tensorboard, keras-applications, tensorflow-estimator, gast, tensorflow, resampy, prettytable, turicreate
Found existing installation: llvmlite 0.34.0
Uninstalling llvmlite-0.34.0:
Successfully uninstalled llvmlite-0.34.0
Found existing installation: numba 0.51.2
Uninstalling numba-0.51.2:
Successfully uninstalled numba-0.51.2
Found existing installation: tensorboard 2.4.1
Uninstalling tensorboard-2.4.1:
Successfully uninstalled tensorboard-2.4.1
Found existing installation: tensorflow-estimator 2.4.0
Uninstalling tensorflow-estimator-2.4.0:
Successfully uninstalled tensorflow-estimator-2.4.0
Found existing installation: gast 0.3.3
Uninstalling gast-0.3.3:
Successfully uninstalled gast-0.3.3
Found existing installation: tensorflow 2.4.1
Uninstalling tensorflow-2.4.1:
Successfully uninstalled tensorflow-2.4.1
Found existing installation: resampy 0.2.2
Uninstalling resampy-0.2.2:
Successfully uninstalled resampy-0.2.2
Found existing installation: prettytable 2.1.0
Uninstalling prettytable-2.1.0:
Successfully uninstalled prettytable-2.1.0
Successfully installed coremltools-3.3 gast-0.2.2 keras-applications-1.0.8 llvmlite-0.33.0 numba-0.50.1 prettytable-0.7.2 resampy-0.2.1 tensorboard-2.0.2 tensorflow-2.0.4 tensorflow-estimator-2.0.1 turicreate-6.4.1
Requirement already satisfied: seaborn in /usr/local/lib/python3.7/dist-packages (0.11.1)
Requirement already satisfied: scipy>=1.0 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.4.1)
Requirement already satisfied: pandas>=0.23 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.1.5)
Requirement already satisfied: matplotlib>=2.2 in /usr/local/lib/python3.7/dist-packages (from seaborn) (3.2.2)
Requirement already satisfied: numpy>=1.15 in /usr/local/lib/python3.7/dist-packages (from seaborn) (1.19.5)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23->seaborn) (2018.9)
Requirement already satisfied: python-dateutil>=2.7.3 in /usr/local/lib/python3.7/dist-packages (from pandas>=0.23->seaborn) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (1.3.1)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.7/dist-packages (from matplotlib>=2.2->seaborn) (2.4.7)
Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.7/dist-packages (from python-dateutil>=2.7.3->pandas>=0.23->seaborn) (1.15.0)
Requirement already satisfied: tensorflow in /usr/local/lib/python3.7/dist-packages (2.0.4)
Requirement already satisfied: absl-py>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.12.0)
Requirement already satisfied: protobuf>=3.6.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.12.4)
Requirement already satisfied: tensorboard<2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.0.2)
Requirement already satisfied: opt-einsum>=2.3.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (3.3.0)
Requirement already satisfied: grpcio>=1.8.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.32.0)
Requirement already satisfied: wrapt>=1.11.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.12.1)
Requirement already satisfied: google-pasta>=0.1.6 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.2.0)
Requirement already satisfied: keras-preprocessing>=1.0.5 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.2)
Requirement already satisfied: six>=1.10.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.15.0)
Requirement already satisfied: tensorflow-estimator<2.1.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.0.1)
Requirement already satisfied: gast==0.2.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.2.2)
Requirement already satisfied: wheel>=0.26; python_version >= "3" in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.36.2)
Collecting numpy<1.19.0,>=1.16.0
Downloading https://files.pythonhosted.org/packages/d6/c6/58e517e8b1fb192725cfa23c01c2e60e4e6699314ee9684a1c5f5c9b27e1/numpy-1.18.5-cp37-cp37m-manylinux1_x86_64.whl (20.1MB)
|████████████████████████████████| 20.1MB 1.6MB/s
Requirement already satisfied: keras-applications>=1.0.8 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.0.8)
Requirement already satisfied: termcolor>=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (1.1.0)
Requirement already satisfied: h5py<=2.10.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (2.10.0)
Requirement already satisfied: astor>=0.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow) (0.8.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.6.1->tensorflow) (56.1.0)
Requirement already satisfied: requests<3,>=2.21.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow) (2.23.0)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow) (0.4.4)
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow) (1.0.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow) (3.3.4)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard<2.1.0,>=2.0.0->tensorflow) (1.28.1)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow) (2020.12.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow) (1.24.3)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests<3,>=2.21.0->tensorboard<2.1.0,>=2.0.0->tensorflow) (3.0.4)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow) (1.3.0)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow) (3.10.1)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow) (4.2.1)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow) (4.7.2)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard<2.1.0,>=2.0.0->tensorflow) (3.1.0)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow) (3.4.1)
Requirement already satisfied: typing-extensions>=3.6.4; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard<2.1.0,>=2.0.0->tensorflow) (3.7.4.3)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard<2.1.0,>=2.0.0->tensorflow) (0.4.8)
ERROR: tensorflow-probability 0.12.1 has requirement gast>=0.3.2, but you'll have gast 0.2.2 which is incompatible.
ERROR: librosa 0.8.0 has requirement resampy>=0.2.2, but you'll have resampy 0.2.1 which is incompatible.
ERROR: datascience 0.10.6 has requirement folium==0.2.1, but you'll have folium 0.8.3 which is incompatible.
ERROR: albumentations 0.1.12 has requirement imgaug<0.2.7,>=0.2.5, but you'll have imgaug 0.2.9 which is incompatible.
Installing collected packages: numpy
Found existing installation: numpy 1.19.5
Uninstalling numpy-1.19.5:
Successfully uninstalled numpy-1.19.5
Successfully installed numpy-1.18.5
Collecting deepctr[gpu]
Downloading https://files.pythonhosted.org/packages/e1/23/a0c89b3a1631f8017dde94ee096db6ba14dfe0c996df8d5a0bdfb795ca54/deepctr-0.8.5-py3-none-any.whl (116kB)
|████████████████████████████████| 122kB 6.4MB/s
Requirement already satisfied: h5py==2.10.0 in /usr/local/lib/python3.7/dist-packages (from deepctr[gpu]) (2.10.0)
Requirement already satisfied: requests in /usr/local/lib/python3.7/dist-packages (from deepctr[gpu]) (2.23.0)
Collecting tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"
Downloading https://files.pythonhosted.org/packages/85/cc/a27e73cf8b23f2ce4bdd2b7089a42a7819ce6dd7366dceba406ddc5daa9c/tensorflow_gpu-2.4.1-cp37-cp37m-manylinux2010_x86_64.whl (394.3MB)
|████████████████████████████████| 394.3MB 37kB/s
Requirement already satisfied: six in /usr/local/lib/python3.7/dist-packages (from h5py==2.10.0->deepctr[gpu]) (1.15.0)
Requirement already satisfied: numpy>=1.7 in /usr/local/lib/python3.7/dist-packages (from h5py==2.10.0->deepctr[gpu]) (1.18.5)
Requirement already satisfied: chardet<4,>=3.0.2 in /usr/local/lib/python3.7/dist-packages (from requests->deepctr[gpu]) (3.0.4)
Requirement already satisfied: idna<3,>=2.5 in /usr/local/lib/python3.7/dist-packages (from requests->deepctr[gpu]) (2.10)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.7/dist-packages (from requests->deepctr[gpu]) (2020.12.5)
Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.7/dist-packages (from requests->deepctr[gpu]) (1.24.3)
Requirement already satisfied: protobuf>=3.9.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.12.4)
Requirement already satisfied: termcolor~=1.1.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.1.0)
Collecting gast==0.3.3
Downloading https://files.pythonhosted.org/packages/d6/84/759f5dd23fec8ba71952d97bcc7e2c9d7d63bdc582421f3cd4be845f0c98/gast-0.3.3-py2.py3-none-any.whl
Requirement already satisfied: absl-py~=0.10 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.12.0)
Requirement already satisfied: flatbuffers~=1.12.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.12)
Requirement already satisfied: google-pasta~=0.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.2.0)
Requirement already satisfied: opt-einsum~=3.3.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.3.0)
Collecting tensorboard~=2.4
Downloading https://files.pythonhosted.org/packages/44/f5/7feea02a3fb54d5db827ac4b822a7ba8933826b36de21880518250b8733a/tensorboard-2.5.0-py3-none-any.whl (6.0MB)
|████████████████████████████████| 6.0MB 42.1MB/s
Requirement already satisfied: grpcio~=1.32.0 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.32.0)
Requirement already satisfied: keras-preprocessing~=1.1.2 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.1.2)
Requirement already satisfied: wheel~=0.35 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.36.2)
Requirement already satisfied: typing-extensions~=3.7.4 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.7.4.3)
Requirement already satisfied: wrapt~=1.12.1 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.12.1)
Requirement already satisfied: astunparse~=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.6.3)
Collecting tensorflow-estimator<2.5.0,>=2.4.0
Downloading https://files.pythonhosted.org/packages/74/7e/622d9849abf3afb81e482ffc170758742e392ee129ce1540611199a59237/tensorflow_estimator-2.4.0-py2.py3-none-any.whl (462kB)
|████████████████████████████████| 471kB 41.8MB/s
Requirement already satisfied: setuptools in /usr/local/lib/python3.7/dist-packages (from protobuf>=3.9.2->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (56.1.0)
Collecting tensorboard-data-server<0.7.0,>=0.6.0
Downloading https://files.pythonhosted.org/packages/60/f9/802efd84988bffd9f644c03b6e66fde8e76c3aa33db4279ddd11c5d61f4b/tensorboard_data_server-0.6.1-py3-none-manylinux2010_x86_64.whl (4.9MB)
|████████████████████████████████| 4.9MB 39.1MB/s
Requirement already satisfied: werkzeug>=0.11.15 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.0.1)
Requirement already satisfied: tensorboard-plugin-wit>=1.6.0 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.8.0)
Requirement already satisfied: google-auth<2,>=1.6.3 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.28.1)
Requirement already satisfied: markdown>=2.6.8 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.3.4)
Requirement already satisfied: google-auth-oauthlib<0.5,>=0.4.1 in /usr/local/lib/python3.7/dist-packages (from tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.4.4)
Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.2.8)
Requirement already satisfied: rsa<5,>=3.1.4; python_version >= "3.6" in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (4.7.2)
Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.7/dist-packages (from google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (4.2.1)
Requirement already satisfied: importlib-metadata; python_version < "3.8" in /usr/local/lib/python3.7/dist-packages (from markdown>=2.6.8->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.10.1)
Requirement already satisfied: requests-oauthlib>=0.7.0 in /usr/local/lib/python3.7/dist-packages (from google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (1.3.0)
Requirement already satisfied: pyasn1<0.5.0,>=0.4.6 in /usr/local/lib/python3.7/dist-packages (from pyasn1-modules>=0.2.1->google-auth<2,>=1.6.3->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (0.4.8)
Requirement already satisfied: zipp>=0.5 in /usr/local/lib/python3.7/dist-packages (from importlib-metadata; python_version < "3.8"->markdown>=2.6.8->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.4.1)
Requirement already satisfied: oauthlib>=3.0.0 in /usr/local/lib/python3.7/dist-packages (from requests-oauthlib>=0.7.0->google-auth-oauthlib<0.5,>=0.4.1->tensorboard~=2.4->tensorflow-gpu!=1.7.*,!=1.8.*,>=1.4.0; extra == "gpu"->deepctr[gpu]) (3.1.0)
ERROR: tensorflow 2.0.4 has requirement gast==0.2.2, but you'll have gast 0.3.3 which is incompatible.
ERROR: tensorflow 2.0.4 has requirement tensorboard<2.1.0,>=2.0.0, but you'll have tensorboard 2.5.0 which is incompatible.
ERROR: tensorflow 2.0.4 has requirement tensorflow-estimator<2.1.0,>=2.0.0, but you'll have tensorflow-estimator 2.4.0 which is incompatible.
ERROR: tensorflow-gpu 2.4.1 has requirement numpy~=1.19.2, but you'll have numpy 1.18.5 which is incompatible.
Installing collected packages: gast, tensorboard-data-server, tensorboard, tensorflow-estimator, tensorflow-gpu, deepctr
Found existing installation: gast 0.2.2
Uninstalling gast-0.2.2:
Successfully uninstalled gast-0.2.2
Found existing installation: tensorboard 2.0.2
Uninstalling tensorboard-2.0.2:
Successfully uninstalled tensorboard-2.0.2
Found existing installation: tensorflow-estimator 2.0.1
Uninstalling tensorflow-estimator-2.0.1:
Successfully uninstalled tensorflow-estimator-2.0.1
Successfully installed deepctr-0.8.5 gast-0.3.3 tensorboard-2.5.0 tensorboard-data-server-0.6.1 tensorflow-estimator-2.4.0 tensorflow-gpu-2.4.1
import time
import numpy as np
import pandas as pd
import seaborn as sns
from collections import defaultdict
import turicreate as tc
from turicreate.toolkits.recommender.util import precision_recall_by_user
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import backend as K
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import plot_model
from tensorflow.keras.callbacks import EarlyStopping, ModelCheckpoint, Callback
from tensorflow.keras.models import Model, load_model, save_model
from tensorflow.keras.regularizers import l2
from tensorflow.keras.layers import Embedding, Input, Dense, Flatten, Dropout, Multiply, Concatenate, Reshape, Lambda, Dot, Add, Activation, Subtract
from tensorflow.keras.optimizers import Adagrad, Adam, SGD, RMSprop, Adamax # maybe remove
from deepctr.models import DeepFM
from deepctr.feature_column import SparseFeat, DenseFeat, VarLenSparseFeat, get_feature_names
from os.path import join, isdir
from os import mkdir, getcwd
from matplotlib import pyplot as plt
import warnings
warnings.filterwarnings('ignore')
from urllib.request import urlretrieve
import zipfile
# Download MovieLens 100K dataset.
print("Downloading MovieLens data...")
DATASET_ZIP_PATH = "http://files.grouplens.org/datasets/movielens/ml-100k.zip"
DATASET_OUTPUT_DIR = 'ml-100k'
urlretrieve(DATASET_ZIP_PATH, "movielens.zip")
zip_ref = zipfile.ZipFile('movielens.zip', "r")
zip_ref.extractall()
print("Done.")
Downloading MovieLens data... Done.
def get_file_path(filename):
return join(DATASET_OUTPUT_DIR, filename)
print("Dataset contains the following:")
with open(get_file_path('u.info'), 'r') as info_file:
print(info_file.read())
Dataset contains the following: 943 users 1682 items 100000 ratings
def decrement_ids(col_name, df):
'''Used to decrement movie/user id to start from 0'''
return df[col_name].apply(lambda x: int(x-1))
users_cols = ['user_id', 'age', 'sex', 'occupation', 'zip_code']
users = pd.read_csv(get_file_path('u.user'), sep='|', names=users_cols, encoding='latin-1')
# Since the ids start at 1, we shift them to start at 0.
users["user_id"] = decrement_ids('user_id', users)
print(f'User data shape: {users.shape}')
users.head(10)
User data shape: (943, 5)
| user_id | age | sex | occupation | zip_code | |
|---|---|---|---|---|---|
| 0 | 0 | 24 | M | technician | 85711 |
| 1 | 1 | 53 | F | other | 94043 |
| 2 | 2 | 23 | M | writer | 32067 |
| 3 | 3 | 24 | M | technician | 43537 |
| 4 | 4 | 33 | F | other | 15213 |
| 5 | 5 | 42 | M | executive | 98101 |
| 6 | 6 | 57 | M | administrator | 91344 |
| 7 | 7 | 36 | M | administrator | 05201 |
| 8 | 8 | 29 | M | student | 01002 |
| 9 | 9 | 53 | M | lawyer | 90703 |
# Load movies dataset
# The movies file contains a binary feature for each genre.
genre_cols = [
"Unknown", "Action", "Adventure", "Animation", "Children", "Comedy",
"Crime", "Documentary", "Drama", "Fantasy", "Film-Noir", "Horror",
"Musical", "Mystery", "Romance", "Sci-Fi", "Thriller", "War", "Western"
]
movies_cols = ['movie_id', 'title', 'release_date', "video_release_date", "imdb_url"] + genre_cols
movies = pd.read_csv(get_file_path('u.item'), sep='|', names=movies_cols, encoding='latin-1')
# Since the ids start at 1, we shift them to start at 0.
movies["movie_id"] = decrement_ids('movie_id', movies)
movies[genre_cols] = movies[genre_cols].astype(int)
movies['release_date'] = movies['release_date'].fillna('')
movies["year"] = movies['release_date'].apply(lambda x: int(str(x).split('-')[-1]) if x else 0) # Seperate year information about each movie
movies['release_date'] = movies['release_date'].replace('', np.nan)
if movies['video_release_date'].isnull().all():
movies.drop('video_release_date', axis=1, inplace=True)
print(f'Movie data shape: {movies.shape}')
movies.head(10)
Movie data shape: (1682, 24)
| movie_id | title | release_date | imdb_url | Unknown | Action | Adventure | Animation | Children | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western | year | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | Toy Story (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Toy%20Story%2... | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1995 |
| 1 | 1 | GoldenEye (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?GoldenEye%20(... | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1995 |
| 2 | 2 | Four Rooms (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Four%20Rooms%... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1995 |
| 3 | 3 | Get Shorty (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Get%20Shorty%... | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1995 |
| 4 | 4 | Copycat (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Copycat%20(1995) | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1995 |
| 5 | 5 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | 01-Jan-1995 | http://us.imdb.com/Title?Yao+a+yao+yao+dao+wai... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1995 |
| 6 | 6 | Twelve Monkeys (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Twelve%20Monk... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1995 |
| 7 | 7 | Babe (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Babe%20(1995) | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1995 |
| 8 | 8 | Dead Man Walking (1995) | 01-Jan-1995 | http://us.imdb.com/M/title-exact?Dead%20Man%20... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1995 |
| 9 | 9 | Richard III (1995) | 22-Jan-1996 | http://us.imdb.com/M/title-exact?Richard%20III... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1996 |
# Load ratings dataset
ratings_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings = pd.read_csv(get_file_path('u.data'), sep='\t', names=ratings_cols, encoding='latin-1')
# Since the ids start at 1, we shift them to start at 0.
ratings["movie_id"] = decrement_ids('movie_id', ratings)
ratings["user_id"] = decrement_ids('user_id', ratings)
rating_user_count = ratings["user_id"].drop_duplicates().count()
rated_movie_count = ratings["movie_id"].drop_duplicates().count()
print(f'Rating data shape: {ratings.shape}')
print(f'Rating user count: {rating_user_count}')
print(f'Rated movie count: {rated_movie_count}')
ratings.head(10)
Rating data shape: (100000, 4) Rating user count: 943 Rated movie count: 1682
| user_id | movie_id | rating | unix_timestamp | |
|---|---|---|---|---|
| 0 | 195 | 241 | 3 | 881250949 |
| 1 | 185 | 301 | 3 | 891717742 |
| 2 | 21 | 376 | 1 | 878887116 |
| 3 | 243 | 50 | 2 | 880606923 |
| 4 | 165 | 345 | 1 | 886397596 |
| 5 | 297 | 473 | 4 | 884182806 |
| 6 | 114 | 264 | 2 | 881171488 |
| 7 | 252 | 464 | 5 | 891628467 |
| 8 | 304 | 450 | 3 | 886324817 |
| 9 | 5 | 85 | 3 | 883603013 |
sns.displot(x=ratings['rating'], bins=20, kde=True, color="b", height=5, aspect=2)
plt.title("Distribution of Users Ratings")
plt.ylabel('Number of Ratings')
plt.xlabel('Rating (Out of 5)')
plt.show()
Observe at the top 20 mean ratings per movie
ratings.groupby(by='movie_id')['rating'].mean().sort_values(ascending=False).head(20)
movie_id 1292 5.000000 1466 5.000000 1652 5.000000 813 5.000000 1121 5.000000 1598 5.000000 1200 5.000000 1188 5.000000 1499 5.000000 1535 5.000000 1448 4.625000 1641 4.500000 118 4.500000 1397 4.500000 1593 4.500000 407 4.491071 317 4.466443 168 4.466102 482 4.456790 113 4.447761 Name: rating, dtype: float64
def get_average_movie_ratings(ratings_df):
average_ratings = ratings_df.groupby('movie_id', as_index=False).agg(avg_rating=('rating', 'mean'), count=('rating', 'count')).reset_index() # Average movie ratings
return pd.merge(average_ratings, movies, on='movie_id')[['movie_id', 'avg_rating', 'title', 'count']]
def plot_ratings_histogram(title, df_col, bins=10, df_col2=None, col1_name='', col2_name='', x_name=''):
if df_col2 is None:
sns.displot(x=df_col, bins=10, kde=True, color="b", height=5, aspect=2)
plt.xlabel('Average Rating', fontsize=12)
plt.ylabel('Movie Count', fontsize=12)
plt.title(title)
plt.show()
else:
data = [(col1_name, val) for val in df_col.values] + [(col2_name, val) for val in df_col2.values]
data = pd.DataFrame(data=data, columns=['Data', x_name])
p = sns.displot(data=data, x=x_name, hue='Data', kind='kde', fill=True, palette=sns.color_palette('bright')[:2], height=5, aspect=1.5)
p.fig.suptitle(title, fontsize=15)
p.fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
average_ratings = get_average_movie_ratings(ratings_df=ratings)
average_ratings.head(10)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 0 | 0 | 3.878319 | Toy Story (1995) | 452 |
| 1 | 1 | 3.206107 | GoldenEye (1995) | 131 |
| 2 | 2 | 3.033333 | Four Rooms (1995) | 90 |
| 3 | 3 | 3.550239 | Get Shorty (1995) | 209 |
| 4 | 4 | 3.302326 | Copycat (1995) | 86 |
| 5 | 5 | 3.576923 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | 26 |
| 6 | 6 | 3.798469 | Twelve Monkeys (1995) | 392 |
| 7 | 7 | 3.995434 | Babe (1995) | 219 |
| 8 | 8 | 3.896321 | Dead Man Walking (1995) | 299 |
| 9 | 9 | 3.831461 | Richard III (1995) | 89 |
plot_ratings_histogram(df_col=average_ratings['avg_rating'], title='Movie Ratings')
no_of_ratings_per_movie = ratings.groupby('movie_id')['rating'].count().sort_values(ascending=False)
ax = sns.displot(y=no_of_ratings_per_movie, bins=np.ceil(len(no_of_ratings_per_movie) / 20).astype(np.int32), kde=True, color="b", height=6, aspect=3)
plt.title("Distribution of Movie Ratings")
plt.ylabel('Number of Movie Ratings')
plt.xlabel('Movie Count')
plt.yticks(np.arange(0, max(no_of_ratings_per_movie) + 25, 25))
plt.show()
average_ratings.sort_values(['avg_rating', 'count'], ascending=(False, False), inplace=True)
average_ratings.head(3)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 1188 | 1188 | 5.0 | Prefontaine (1997) | 3 |
| 1292 | 1292 | 5.0 | Star Kid (1997) | 3 |
| 1466 | 1466 | 5.0 | Saint of Fort Washington, The (1993) | 2 |
user_ratings = pd.merge(users, ratings, on='user_id')
male_ratings = user_ratings.query('sex == "M"')[['movie_id', 'rating']]
average_male_ratings = get_average_movie_ratings(ratings_df=male_ratings)
average_male_ratings.head(10)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 0 | 0 | 3.909910 | Toy Story (1995) | 333 |
| 1 | 1 | 3.178571 | GoldenEye (1995) | 112 |
| 2 | 2 | 3.108108 | Four Rooms (1995) | 74 |
| 3 | 3 | 3.591463 | Get Shorty (1995) | 164 |
| 4 | 4 | 3.140625 | Copycat (1995) | 64 |
| 5 | 5 | 3.571429 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | 21 |
| 6 | 6 | 3.861290 | Twelve Monkeys (1995) | 310 |
| 7 | 7 | 3.974843 | Babe (1995) | 159 |
| 8 | 8 | 3.884259 | Dead Man Walking (1995) | 216 |
| 9 | 9 | 3.869565 | Richard III (1995) | 69 |
plot_ratings_histogram(df_col=average_male_ratings['avg_rating'], title='Male Movie Ratings')
average_male_ratings.sort_values(['avg_rating', 'count'], ascending=(False, False), inplace=True)
average_male_ratings.head(3)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 1290 | 1292 | 5.0 | Star Kid (1997) | 3 |
| 1172 | 1174 | 5.0 | Hugo Pool (1997) | 2 |
| 1186 | 1188 | 5.0 | Prefontaine (1997) | 2 |
female_ratings = user_ratings.query('sex == "F"')[['movie_id', 'rating']]
average_female_ratings = get_average_movie_ratings(ratings_df=female_ratings)
average_female_ratings.head(10)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 0 | 0 | 3.789916 | Toy Story (1995) | 119 |
| 1 | 1 | 3.368421 | GoldenEye (1995) | 19 |
| 2 | 2 | 2.687500 | Four Rooms (1995) | 16 |
| 3 | 3 | 3.400000 | Get Shorty (1995) | 45 |
| 4 | 4 | 3.772727 | Copycat (1995) | 22 |
| 5 | 5 | 3.600000 | Shanghai Triad (Yao a yao yao dao waipo qiao) ... | 5 |
| 6 | 6 | 3.560976 | Twelve Monkeys (1995) | 82 |
| 7 | 7 | 4.050000 | Babe (1995) | 60 |
| 8 | 8 | 3.927711 | Dead Man Walking (1995) | 83 |
| 9 | 9 | 3.700000 | Richard III (1995) | 20 |
plot_ratings_histogram(df_col=average_female_ratings['avg_rating'], title='Female Movie Ratings')
average_female_ratings.sort_values(['avg_rating', 'count'], ascending=(False, False), inplace=True)
average_female_ratings.head(3)
| movie_id | avg_rating | title | count | |
|---|---|---|---|---|
| 1302 | 1367 | 5.0 | Mina Tannenbaum (1994) | 2 |
| 73 | 73 | 5.0 | Faster Pussycat! Kill! Kill! (1965) | 1 |
| 117 | 118 | 5.0 | Maya Lin: A Strong Clear Vision (1994) | 1 |
male_female_average_ratings = pd.merge(average_male_ratings, average_female_ratings, on='movie_id', suffixes=('_male', '_female'))
male_female_average_ratings.drop('title_male', 1, inplace=True)
male_female_average_ratings.rename(columns={'title_female': 'title'}, inplace=True)
male_female_average_ratings['rating_abs_diff'] = male_female_average_ratings.apply(lambda row: abs(row['avg_rating_male'] - row['avg_rating_female']), axis=1)
male_female_average_ratings.sort_values(['rating_abs_diff', 'count_female', 'count_male'], ascending=False, inplace=True)
male_female_average_ratings.head(10)
| movie_id | avg_rating_male | count_male | avg_rating_female | title | count_female | rating_abs_diff | |
|---|---|---|---|---|---|---|---|
| 5 | 1305 | 5.000000 | 1 | 1.0 | Delta of Venus (1994) | 1 | 4.000000 |
| 8 | 850 | 4.666667 | 3 | 1.0 | Two or Three Things I Know About Her (1966) | 1 | 3.666667 |
| 11 | 1428 | 4.500000 | 2 | 1.0 | Sliding Doors (1998) | 2 | 3.500000 |
| 19 | 640 | 4.419355 | 31 | 1.0 | Paths of Glory (1957) | 2 | 3.419355 |
| 53 | 1591 | 4.250000 | 4 | 1.0 | Magic Hour, The (1998) | 1 | 3.250000 |
| 169 | 1274 | 4.000000 | 2 | 1.0 | Killer (Bulletproof Heart) (1994) | 2 | 3.000000 |
| 1433 | 838 | 1.000000 | 2 | 4.0 | Loch Ness (1995) | 2 | 3.000000 |
| 1435 | 1025 | 1.000000 | 2 | 4.0 | Lay of the Land, The (1997) | 2 | 3.000000 |
| 162 | 1557 | 4.000000 | 6 | 1.0 | Aparajito (1956) | 1 | 3.000000 |
| 166 | 625 | 4.000000 | 3 | 1.0 | So Dear to My Heart (1949) | 1 | 3.000000 |
Include only movies that are above the average amount of raters according to male and female raters.
mean_count_per_movie_male = np.mean(male_female_average_ratings['count_male'])
mean_count_per_movie_female = np.mean(male_female_average_ratings['count_female'])
male_female_average_ratings['above_mean_count'] = male_female_average_ratings.apply(lambda row: row['count_male'] >= mean_count_per_movie_male and row['count_female'] >= mean_count_per_movie_female, axis=1)
male_female_average_ratings.sort_values(['above_mean_count', 'rating_abs_diff'], ascending=False, inplace=True)
male_female_average_ratings.head(10)
| movie_id | avg_rating_male | count_male | avg_rating_female | title | count_female | rating_abs_diff | above_mean_count | |
|---|---|---|---|---|---|---|---|---|
| 1090 | 719 | 2.746032 | 63 | 3.782609 | First Knight (1995) | 23 | 1.036577 | True |
| 1073 | 154 | 2.774194 | 62 | 3.666667 | Dirty Dancing (1987) | 36 | 0.892473 | True |
| 47 | 524 | 4.250000 | 52 | 3.476190 | Big Sleep, The (1946) | 21 | 0.773810 | True |
| 1091 | 475 | 2.742574 | 101 | 3.491525 | First Wives Club, The (1996) | 59 | 0.748951 | True |
| 91 | 155 | 4.127119 | 118 | 3.433333 | Reservoir Dogs (1992) | 30 | 0.693785 | True |
| 314 | 692 | 3.791667 | 72 | 3.105263 | Casino (1995) | 19 | 0.686404 | True |
| 981 | 553 | 2.927711 | 83 | 2.263158 | Waterworld (1995) | 19 | 0.664553 | True |
| 440 | 484 | 3.602410 | 83 | 4.238095 | My Fair Lady (1964) | 42 | 0.635686 | True |
| 1190 | 28 | 2.538462 | 91 | 3.173913 | Batman Forever (1995) | 23 | 0.635452 | True |
| 809 | 4 | 3.140625 | 64 | 3.772727 | Copycat (1995) | 22 | 0.632102 | True |
def get_low_high_rated_by_percentile(movie_ratings_df, percentile):
upper_qunatile = movie_ratings_df['avg_rating'].quantile(1 - percentile)
lower_qunatile = movie_ratings_df['avg_rating'].quantile(percentile)
high_rated_movies = movie_ratings_df.loc[movie_ratings_df['avg_rating'] >= upper_qunatile]
low_rated_movies = movie_ratings_df.loc[movie_ratings_df['avg_rating'] <= lower_qunatile]
return high_rated_movies, upper_qunatile, low_rated_movies, lower_qunatile
def get_low_high_genere_distibution(movie_ratings_df):
high_rated_movies, high_rating, low_rated_movies, low_rating = get_low_high_rated_by_percentile(movie_ratings_df, percentile=0.2) # 20% highest and 20% lowest
high_rated_dict = {}
low_rated_dict = {}
for genre in genre_cols:
high_rated_genere = high_rated_movies.apply(lambda x: True if x[genre] == 1 else False , axis=1)
low_rated_genere = low_rated_movies.apply(lambda x: True if x[genre] == 1 else False , axis=1)
num_of_row_high_rated = len(high_rated_genere[high_rated_genere == True])
num_of_row_low_rated = len(low_rated_genere[low_rated_genere == True])
high_rated_dict[genre] = num_of_row_high_rated
low_rated_dict[genre] = num_of_row_low_rated
return (high_rated_dict, high_rating), (low_rated_dict, low_rating)
def plot_low_high_genere_distribution_histogram(low_rated, low_rating, high_rated, high_rating, title):
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(33, 10))
genere_ratings = {'High': (pd.DataFrame(high_rated.items()), high_rating), 'Low': (pd.DataFrame(low_rated.items()), low_rating)}
for i, ax in enumerate(axes):
sns.barplot(x=0, y=1, data=list(genere_ratings.values())[i][0], ax=ax)
for p in ax.patches:
height = p.get_height() # get bar height
ax.text(p.get_x() + p.get_width() / 2,
height + p.get_y(),
f'{int(height)}',
ha = 'center', # horizontal alignment
va = 'bottom') # vertical alignment
ax.set(xlabel = 'Genre', ylabel='Count', title=f'{list(genere_ratings.keys())[i]} Ratings ({list(genere_ratings.values())[i][1]})')
fig.suptitle(f'High/Low Rated Movies By Genere - {title}', fontsize=20)
fig.tight_layout(rect=[0, 0.03, 1, 0.95])
average_ratings_with_genre = pd.merge(average_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title='All')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 3.75): 606 Lower rated total: (Below or equal 2.5): 542
average_ratings_with_genre = pd.merge(average_male_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title='Male')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 3.8333333333333335): 557 Lower rated total: (Below or equal 2.5): 517
average_ratings_with_genre = pd.merge(average_female_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title='Female')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 3.7681574239713775): 539 Lower rated total: (Below or equal 2.5): 523
sns.displot(x=users['age'], bins=25, kde=True, color="b", height=5, aspect=2)
plt.title("Distribution of Users Ages")
plt.ylabel('Number of Users')
plt.xlabel('Age')
plt.show()
ages = user_ratings['age']
max_age = max(ages)
min_age = min(ages)
age_groups = {'kids': (min_age, 13), 'teens': (14, 21), 'adults': (22, 65), 'elder': (66, max_age)}
kids_ratings = user_ratings.query(f'age >= {age_groups["kids"][0]} and age <= {age_groups["kids"][1]}')[['movie_id', 'rating']]
average_kids_ratings = get_average_movie_ratings(ratings_df=kids_ratings)
teens_ratings = user_ratings.query(f'age >= {age_groups["teens"][0]} and age <= {age_groups["teens"][1]}')[['movie_id', 'rating']]
average_teens_ratings = get_average_movie_ratings(ratings_df=teens_ratings)
adults_ratings = user_ratings.query(f'age >= {age_groups["adults"][0]} and age <= {age_groups["adults"][1]}')[['movie_id', 'rating']]
average_adults_ratings = get_average_movie_ratings(ratings_df=adults_ratings)
elder_ratings = user_ratings.query(f'age >= {age_groups["elder"][0]} and age <= {age_groups["elder"][1]}')[['movie_id', 'rating']]
average_elder_ratings = get_average_movie_ratings(ratings_df=elder_ratings)
average_ratings_with_genre = pd.merge(average_kids_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title=f'Kids {age_groups["kids"][0]} to {age_groups["kids"][1]}')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 4.0): 469 Lower rated total: (Below or equal 3.0): 422
average_ratings_with_genre = pd.merge(average_teens_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title=f'Teens {age_groups["teens"][0]} to {age_groups["teens"][1]}')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 4.0): 523 Lower rated total: (Below or equal 2.5): 454
average_ratings_with_genre = pd.merge(average_adults_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title=f'Adults {age_groups["adults"][0]} to {age_groups["adults"][1]}')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 3.772803666921314): 584 Lower rated total: (Below or equal 2.5): 547
average_ratings_with_genre = pd.merge(average_elder_ratings, movies, on='movie_id')
(high_rated_dict_all, high_rating), (low_rated_dict_all, low_rating) = get_low_high_genere_distibution(average_ratings_with_genre)
plot_low_high_genere_distribution_histogram(low_rated=low_rated_dict_all, low_rating=low_rating, high_rated=high_rated_dict_all, high_rating=high_rating, title=f'Elder {age_groups["elder"][0]} to {age_groups["elder"][1]}')
print(f'Highly rated total: (Above or equal {high_rating}): {sum(high_rated_dict_all.values())}')
print(f'Lower rated total: (Below or equal {low_rating}): {sum(low_rated_dict_all.values())}')
Highly rated total: (Above or equal 4.0): 330 Lower rated total: (Below or equal 3.0): 254
The formula as used in IMDB for calculating the Top Rated movies gives a true Bayesian estimate:
weighted rating (WR) = (v ÷ (v+m)) × R + (m ÷ (v+m)) × C
where:
rating_counts = ratings.groupby('movie_id', as_index=False)['movie_id'].agg({'rating_count':'count'})
m = round(rating_counts['rating_count'].quantile(0.9)) # We take the top 90 precentile of rating count per movie
C = average_ratings.mean()[['avg_rating']]
def weighted_rating(row):
v = row['rating_count']
R = row['avg_rating'] # rating average per movie
return (v / (v + m) * R) + (m / (m + v) * C)
# Filter to include only movies that their rating count is above the threshold m
qualified = pd.merge(average_ratings, rating_counts[rating_counts['rating_count'] >= m], on='movie_id')
# Calculate popularity rating for each movie of the qualified movies
qualified['popularity'] = qualified.apply(weighted_rating, axis=1)
qualified.sort_values('popularity', ascending=False, inplace=True)
qualified
| movie_id | avg_rating | title | count | rating_count | popularity | |
|---|---|---|---|---|---|---|
| 5 | 49 | 4.358491 | Star Wars (1977) | 583 | 583 | 4.070281 |
| 0 | 317 | 4.466443 | Schindler's List (1993) | 298 | 298 | 3.963279 |
| 2 | 63 | 4.445230 | Shawshank Redemption, The (1994) | 283 | 283 | 3.933300 |
| 11 | 126 | 4.283293 | Godfather, The (1972) | 413 | 413 | 3.932735 |
| 9 | 97 | 4.289744 | Silence of the Lambs, The (1991) | 390 | 390 | 3.922811 |
| ... | ... | ... | ... | ... | ... | ... |
| 165 | 545 | 3.031496 | Broken Arrow (1996) | 254 | 254 | 3.049294 |
| 166 | 288 | 2.980695 | Evita (1996) | 259 | 259 | 3.018345 |
| 167 | 322 | 2.933333 | Dante's Peak (1997) | 240 | 240 | 2.992302 |
| 168 | 234 | 2.847926 | Mars Attacks! (1996) | 217 | 217 | 2.947802 |
| 169 | 677 | 2.808219 | Volcano (1997) | 219 | 219 | 2.924875 |
170 rows × 6 columns
sparsity = 1 - len(ratings) / (len(users) * len(movies))
print(f'Sparsity: {sparsity}')
print(f'Sparsity %: {(sparsity * 100):.4f}%')
Sparsity: 0.9369533063577546 Sparsity %: 93.6953%
Observe at the top rating users and their rating count
unique_users = len(users['user_id'].unique())
unique_rating_users = len(ratings['user_id'].unique())
# Test that each user in the dataset rated at least one or more movies
assert unique_users == unique_rating_users
no_of_rated_movies_per_user = ratings.groupby(by='user_id')['rating'].count().sort_values(ascending=False)
no_of_rated_movies_per_user.head(10)
user_id 404 737 654 685 12 636 449 540 275 518 415 493 536 490 302 484 233 480 392 448 Name: rating, dtype: int64
Calculate the mean ratings per user in the entire dataset
print(no_of_rated_movies_per_user.describe())
ratings_per_user_avg = round(ratings.groupby('user_id').agg('size').mean())
print(f'\nAverage rating count per user: {ratings_per_user_avg}')
count 943.000000 mean 106.044539 std 100.931743 min 20.000000 25% 33.000000 50% 65.000000 75% 148.000000 max 737.000000 Name: rating, dtype: float64 Average rating count per user: 106
user_rating_count = ratings.groupby('user_id')['rating'].count()
top_users = user_rating_count.sort_values(ascending=False)[:15] # 15 Most rating users
movie_rating_count = ratings.groupby('movie_id')['rating'].count()
top_movies = movie_rating_count.sort_values(ascending=False)[:15] # 15 Most rated movies
# Attach movie id to movie title
movie_titles = pd.merge(top_movies, movies, on='movie_id')
top_r = ratings.join(top_users, rsuffix='_r', how='inner', on='user_id')
top_r = top_r.join(movie_titles, rsuffix='_r', how='inner', on='movie_id')
pd.crosstab(top_r.user_id, top_r.title, top_r.rating, aggfunc=np.sum).fillna('-')
| title | Air Force One (1997) | Contact (1997) | English Patient, The (1996) | Fargo (1996) | Godfather, The (1972) | Independence Day (ID4) (1996) | Liar Liar (1997) | Pulp Fiction (1994) | Raiders of the Lost Ark (1981) | Return of the Jedi (1983) | Scream (1996) | Silence of the Lambs, The (1991) | Star Wars (1977) | Toy Story (1995) | Twelve Monkeys (1995) |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| user_id | |||||||||||||||
| 6 | 5 | - | - | - | 5 | 4 | - | - | 3 | 5 | 5 | - | - | 5 | - |
| 12 | 3 | 3 | - | - | 5 | - | 1 | 5 | 1 | 5 | 2 | - | 3 | 4 | 4 |
| 180 | 4 | - | 1 | 2 | - | 2 | - | 2 | - | - | 4 | 3 | 3 | - | 1 |
| 233 | 3 | 2 | - | - | 1 | 3 | 3 | 3 | 2 | 4 | 2 | 3 | 3 | 5 | 3 |
| 275 | 5 | 4 | - | 3 | 5 | - | 3 | - | 5 | 4 | 5 | - | 5 | 4 | 4 |
| 278 | - | 4 | - | - | 2 | 4 | - | 3 | - | 4 | 5 | - | 3 | - | - |
| 302 | 5 | 3 | - | 3 | 4 | - | 2 | 4 | 4 | 4 | 4 | 3 | 5 | 5 | - |
| 392 | 4 | 4 | - | 3 | 5 | - | 3 | - | 3 | 4 | 4 | 3 | 3 | 3 | - |
| 404 | - | 1 | - | - | 5 | - | 4 | - | 4 | 4 | - | - | - | 4 | - |
| 415 | 5 | 4 | - | - | 5 | 3 | - | 5 | 4 | 4 | 4 | 4 | 5 | 5 | 4 |
| 428 | - | 3 | - | 2 | 5 | - | - | - | 4 | 4 | 2 | 5 | 3 | 3 | - |
| 449 | - | 4 | - | 4 | 4 | 4 | - | 3 | 5 | 3 | 4 | 3 | 4 | - | - |
| 536 | - | - | 2 | 2 | 3 | 4 | - | 4 | 3 | 2 | 4 | 3 | 2 | - | 4 |
| 654 | 3 | 3 | 4 | - | 3 | - | 2 | 3 | 2 | 2 | 3 | 3 | 2 | 3 | 3 |
| 845 | - | 5 | - | - | 5 | - | - | - | 5 | 5 | - | - | - | 4 | - |
def plot_model_loss_comparison(losses, plot_title, high=False):
'''Plots model MAE and RMSE loss values by the provided dict'''
mae = {k: s['mae'] for k, s in losses.items()}
rmse = {k: s['rmse'] for k, s in losses.items()}
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(len(losses) * 3, 8 if high else 6))
mae_keys = list(mae.keys())
mae_vals = [mae[k] for k in mae_keys]
p1 = sns.barplot(x=mae_keys, y=mae_vals, ax=axes[0], palette="hls")
axes[0].set_title('MAE')
axes[0].set_ylabel('Loss')
axes[0].set_xlabel('Model')
p1.set(yticks=np.arange(0, max(mae_vals) + 0.2, 0.2))
axes[0].tick_params(axis='y', labelsize=10)
for i, v in enumerate(mae_vals):
axes[0].text(i-.25, 0.1, f'{mae_vals[i]:.3f}', fontsize=14)
plt.setp(axes[0].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
axes[0].plot()
rmse_keys = list(rmse.keys())
rmse_vals = [rmse[k] for k in rmse_keys]
p2 = sns.barplot(x=rmse_keys, y=rmse_vals, ax=axes[1], palette="hls")
axes[1].set_title('RMSE')
axes[1].set_ylabel('Loss')
axes[1].set_xlabel('Model')
p2.set(yticks=np.arange(0, max(rmse_vals) + 0.2, 0.2))
axes[1].tick_params(axis='y', labelsize=10)
for i, v in enumerate(rmse_vals):
axes[1].text(i-.25, 0.1, f'{rmse_vals[i]:.3f}', fontsize=14)
plt.setp(axes[1].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
axes[1].plot()
fig.suptitle(plot_title + ' (Lower is better)')
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
def plot_model_train_time_comparison(times, plot_title, high=False):
'''Plots model train time values by the provided dict'''
times = {k: t for k, t in times.items()}
time_keys = list(times.keys())
time_vals = [times[t] for t in time_keys]
p1 = sns.barplot(x=time_keys, y=time_vals, palette="Blues")
# p1.set(yticks=np.arange(0, max(time_vals) + 0.2, 0.2))
ax = p1.axes
ax.tick_params(axis='y', labelsize=10)
for i, v in enumerate(time_vals):
ax.text(i-.25, 0.1, f'{time_vals[i]:.2f}s', fontsize=14)
plt.setp(ax.get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
ax.plot()
plt.xlabel('Model', fontsize=12)
plt.ylabel('Training time', fontsize=12)
plt.title(plot_title)
plt.show()
Precision at k is the proportion of recommended items in the top-k set that are relevant
Recall at k is the proportion of relevant items found in the top-k recommendations
def precision_recall_at_k(data, k=5, threshold=3.5):
"""Return precision and recall at k metrics for each user"""
# First map the predictions to each user.
user_est_true = defaultdict(list)
for uid, true_pred in data:
for true_r, pred_r in true_pred:
user_est_true[uid].append((pred_r, true_r))
precisions = dict()
recalls = dict()
for uid, user_ratings in user_est_true.items():
# Sort user ratings by predicted value
user_ratings.sort(key=lambda x: x[0], reverse=True)
# Number of relevant items
n_rel = sum((true_r >= threshold) for (_, true_r) in user_ratings)
# Number of recommended items in top k
n_rec_k = sum((pred_r >= threshold) for (pred_r, _) in user_ratings[:k])
# Number of relevant and recommended items in top k
n_rel_and_rec_k = sum(((true_r >= threshold) and (pred_r >= threshold))
for (pred_r, true_r) in user_ratings[:k])
# Precision@K: Proportion of recommended items that are relevant
# When n_rec_k is 0, Precision is undefined. We here set it to 0.
precisions[uid] = n_rel_and_rec_k / n_rec_k if n_rec_k != 0 else 0
# Recall@K: Proportion of relevant items that are recommended
# When n_rel is 0, Recall is undefined. We here set it to 0.
recalls[uid] = n_rel_and_rec_k / n_rel if n_rel != 0 else 0
return precisions, recalls
def mean_precision_recall_at_k(test_pred, k=5, threshold=3.5):
"""Return the mean precision and recall at k metrics for all users"""
data = []
for uid, vals in test_pred.sort_values('user_id', ascending=False).groupby('user_id'):
data.append((uid, zip(list(vals['rating'].values), list(vals['rating_pred'].values))))
precisions, recalls = precision_recall_at_k(data, k=k, threshold=threshold)
return np.mean(list(precisions.values())), np.mean(list(recalls.values()))
Here we assume that a relevant movie is a movie with rating value of 4 or above (actual and predicted).
def plot_model_precision_recall_at_k(test_pred, k, plot_title, threshold=4):
'''Plots model precision@k and recall@k values by the provided df and k values with given threshold'''
mprk = [(i, mean_precision_recall_at_k(test_pred, i, threshold)) for i in k]
mprk = [(k, precision, recall) for k, (precision, recall) in mprk]
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(len(mprk) * 3, 6))
keys = [f'Precision@{k}' for k, precision, recall in mprk]
vals = [precision for k, precision, recall in mprk]
p1 = sns.barplot(x=keys, y=vals, ax=axes[0], palette="YlOrBr")
axes[0].set_title('Precision@K')
axes[0].set_ylabel('Precision@K')
axes[0].set_xlabel('K')
p1.set(yticks=np.arange(0, max(vals) + 0.2, 0.2))
axes[0].tick_params(axis='y', labelsize=10)
for i, v in enumerate(vals):
axes[0].text(i-.25, 0.1, f'{vals[i]:.3f}', fontsize=14)
plt.setp(axes[0].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
axes[0].plot()
keys = [f'Recall@{k}' for k, precision, recall in mprk]
vals = [recall for k, recision, recall in mprk]
p2 = sns.barplot(x=keys, y=vals, ax=axes[1], palette="YlOrBr")
axes[1].set_title('Recall@K')
axes[1].set_ylabel('Recall@K')
axes[1].set_xlabel('K')
p2.set(yticks=np.arange(0, max(vals) + 0.2, 0.2))
axes[1].tick_params(axis='y', labelsize=10)
for i, v in enumerate(vals):
axes[1].text(i-.25, 0.1, f'{vals[i]:.3f}', fontsize=14)
plt.setp(axes[1].get_xticklabels(), rotation=45, ha="right", rotation_mode="anchor")
axes[1].plot()
fig.suptitle(plot_title + " - Precision/Recall@K")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
return mprk
def plot_model_precision_recall_at_k_comparison(data, plot_title, high=False, wide=False):
'''Plots models precision@k and recall@k values as provided in given data dict'''
mprk = [(name, i) for name, s in data.items() for i in s]
mprk = [(name, t, k, precision) if t == 'Precision@K' else (name, t, k, recall) for name, (k, precision, recall) in mprk for t in ['Precision@K', 'Recall@K']]
df = pd.DataFrame(mprk, columns = ['Name', 'Type', 'K', 'Value'])
p = sns.catplot(x='K', y='Value', col='Type', hue='Name', data=df, kind='bar', sharey=False, aspect=1 if not wide else 3, size=5, palette="muted")
p.fig.suptitle(plot_title + ' - Precision/Recall @ K Comparison', y=1.0)
p.fig.subplots_adjust(top=0.81,right=0.86)
p._legend.remove()
for i in range(2):
ax = p.facet_axis(0,i)
for patch in ax.patches:
ax.text(patch.get_x() + 0.01, patch.get_height() * 1.02,
f'{patch.get_height():.3}',
color='black',
rotation='horizontal',
size='small')
plt.legend(bbox_to_anchor=(1.05, 1), loc=2, borderaxespad=0.)
plt.show()
# Load train and test data
r_cols = ['user_id', 'movie_id', 'rating', 'unix_timestamp']
ratings_train = pd.read_csv(get_file_path('u1.base'), sep='\t', names=r_cols, encoding='latin-1')
ratings_train['movie_id'] = decrement_ids('movie_id', ratings_train)
ratings_train['user_id'] = decrement_ids('user_id', ratings_train)
ratings_test = pd.read_csv(get_file_path('u1.test'), sep='\t', names=r_cols, encoding='latin-1')
ratings_test['movie_id'] = decrement_ids('movie_id', ratings_test)
ratings_test['user_id'] = decrement_ids('user_id', ratings_test)
print(f'Train data shape: {ratings_train.shape}, Test data shape: {ratings_test.shape}')
Train data shape: (80000, 4), Test data shape: (20000, 4)
ratings_train
| user_id | movie_id | rating | unix_timestamp | |
|---|---|---|---|---|
| 0 | 0 | 0 | 5 | 874965758 |
| 1 | 0 | 1 | 3 | 876893171 |
| 2 | 0 | 2 | 4 | 878542960 |
| 3 | 0 | 3 | 3 | 876893119 |
| 4 | 0 | 4 | 3 | 889751712 |
| ... | ... | ... | ... | ... |
| 79995 | 942 | 1066 | 2 | 875501756 |
| 79996 | 942 | 1073 | 4 | 888640250 |
| 79997 | 942 | 1187 | 3 | 888640250 |
| 79998 | 942 | 1227 | 3 | 888640275 |
| 79999 | 942 | 1329 | 3 | 888692465 |
80000 rows × 4 columns
ratings_test
| user_id | movie_id | rating | unix_timestamp | |
|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 |
| 1 | 0 | 9 | 3 | 875693118 |
| 2 | 0 | 11 | 5 | 878542960 |
| 3 | 0 | 13 | 5 | 874965706 |
| 4 | 0 | 16 | 3 | 875073198 |
| ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 |
| 19996 | 457 | 1100 | 4 | 886397931 |
| 19997 | 458 | 933 | 3 | 879563639 |
| 19998 | 459 | 9 | 3 | 882912371 |
| 19999 | 461 | 681 | 5 | 886365231 |
20000 rows × 4 columns
plot_ratings_histogram(df_col=ratings_train['rating'], df_col2=ratings_test['rating'], title='Train/Test rating distribution comparison', col1_name='Train', col2_name='Test', x_name='Rating')
def plot_gender_histogram(title):
data = [('Train', 1 if val == 'M' else 0) for val in pd.merge(ratings_train, users, on='user_id')['sex'].values] + [('Test', 1 if val == 'M' else 0) for val in pd.merge(ratings_test, users, on='user_id')['sex'].values]
data = pd.DataFrame(data=data, columns=['Data', 'Gender'])
p = sns.displot(data=data, x='Gender', hue='Data', kind='kde', fill=True, palette=sns.color_palette('bright')[:2], height=5, aspect=1.5)
p.set_xticklabels(['Female', 'Male'], fontsize=10)
p.fig.suptitle(title, fontsize=15)
p.fig.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.xticks(np.arange(0, 2, 1.0))
plt.show()
plot_gender_histogram(title='Train/Test Gender distribution')
def plot_age_histogram(title):
data = [('Train', val) for val in pd.merge(ratings_train, users, on='user_id')['age'].values] + [('Test', val) for val in pd.merge(ratings_test, users, on='user_id')['age'].values]
data = pd.DataFrame(data=data, columns=['Data', 'Age'])
p = sns.displot(data=data, x='Age', hue='Data', kind='kde', fill=True, palette=sns.color_palette('bright')[:2], height=5, aspect=1.5)
plt.title(title)
plt.show()
plot_age_histogram(title='Train/Test Age distribution')
def mae(y_true, y_pred):
"""Calculates MAE metric according to true and predicted values"""
return np.mean(np.abs(y_true - y_pred))
def rmse(y_true, y_pred):
"""Calculates RMSE metric according to true and predicted values"""
return np.sqrt(np.mean((y_pred - y_true)**2))
# The baseline model always predicts the average rating per movie
class BaselineRecommender:
def __init__(self, gender=None):
self.gender = gender
def fit(self, data):
if self.gender:
self.data = user_ratings.query(f'sex == "{self.gender}"')[['movie_id', 'rating']]
else:
self.data = data
self.data = get_average_movie_ratings(self.data)
self.model = pd.Series(self.data['avg_rating'].values, index=self.data['movie_id']).to_dict()
def predict(self, test_data):
pred = []
if self.gender:
test_data = pd.merge(users, test_data, on='user_id').query(f'sex == "{self.gender}"')[['user_id', 'movie_id', 'rating']]
else:
test_data = test_data[['user_id', 'movie_id', 'rating']]
test_data['movie_id'].apply(lambda id: pred.append(self.model[id]))
return pred, test_data
def evaluate(self, test_data):
pred, test_data = self.predict(test_data)
return {'MAE': mae(test_data["rating"], pred), 'RMSE': rmse(test_data["rating"], pred)}
baseline_model = BaselineRecommender()
baseline_model.fit(ratings)
basemodel_pred, test_data = baseline_model.predict(ratings_test)
basemodel_loss = baseline_model.evaluate(test_data)
test_pred = test_data.assign(rating_pred=basemodel_pred)
print(f'Total subjects in test data: {len(test_pred)}')
test_pred
Total subjects in test data: 20000
| user_id | movie_id | rating | rating_pred | |
|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 3.576923 |
| 1 | 0 | 9 | 3 | 3.831461 |
| 2 | 0 | 11 | 5 | 4.385768 |
| 3 | 0 | 13 | 5 | 3.967213 |
| 4 | 0 | 16 | 3 | 3.119565 |
| ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 4.029851 |
| 19996 | 457 | 1100 | 4 | 3.770270 |
| 19997 | 458 | 933 | 3 | 2.926471 |
| 19998 | 459 | 9 | 3 | 3.831461 |
| 19999 | 461 | 681 | 5 | 3.060000 |
20000 rows × 4 columns
mprk_baseline_all = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='Baseline All')
print(f'MAE: {basemodel_loss["MAE"]}')
print(f'RMSE: {basemodel_loss["RMSE"]}')
MAE: 0.8072284653453374 RMSE: 1.0083009538696794
gender = 'M'
baseline_model_male = BaselineRecommender(gender)
baseline_model_male.fit(ratings)
basemodel_pred_male, test_data = baseline_model_male.predict(ratings_test)
basemodel_loss_male = baseline_model_male.evaluate(test_data)
test_pred = test_data.assign(rating_pred=basemodel_pred_male)
print(f'Total {gender} gender in test data: {len(test_pred)}')
test_pred
Total M gender in test data: 15167
| user_id | movie_id | rating | rating_pred | |
|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 3.571429 |
| 1 | 0 | 9 | 3 | 3.869565 |
| 2 | 0 | 11 | 5 | 4.399061 |
| 3 | 0 | 13 | 5 | 4.000000 |
| 4 | 0 | 16 | 3 | 3.164557 |
| ... | ... | ... | ... | ... |
| 19985 | 455 | 942 | 4 | 3.513514 |
| 19994 | 457 | 143 | 4 | 3.908163 |
| 19995 | 457 | 647 | 4 | 4.060000 |
| 19996 | 457 | 1100 | 4 | 3.816667 |
| 19997 | 458 | 933 | 3 | 2.763158 |
15167 rows × 4 columns
mprk_baseline_male = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='Baseline Male')
print(f'MAE: {basemodel_loss_male["MAE"]}')
print(f'RMSE: {basemodel_loss_male["RMSE"]}')
MAE: 0.7807317165450502 RMSE: 0.9767495515684287
gender = 'F'
baseline_model_female = BaselineRecommender(gender)
baseline_model_female.fit(ratings)
basemodel_pred_female, test_data = baseline_model_female.predict(ratings_test)
basemodel_loss_female = baseline_model_female.evaluate(test_data)
test_pred = test_data.assign(rating_pred=basemodel_pred_female)
print(f'Total {gender} gender in test data: {len(test_pred)}')
test_pred
Total F gender in test data: 4833
| user_id | movie_id | rating | rating_pred | |
|---|---|---|---|---|
| 137 | 1 | 12 | 4 | 3.269231 |
| 138 | 1 | 18 | 3 | 4.285714 |
| 139 | 1 | 49 | 5 | 4.245033 |
| 140 | 1 | 250 | 5 | 4.529412 |
| 141 | 1 | 256 | 4 | 3.650000 |
| ... | ... | ... | ... | ... |
| 19991 | 456 | 703 | 4 | 2.818182 |
| 19992 | 456 | 707 | 4 | 3.566667 |
| 19993 | 456 | 774 | 3 | 2.933333 |
| 19998 | 459 | 9 | 3 | 3.700000 |
| 19999 | 461 | 681 | 5 | 3.187500 |
4833 rows × 4 columns
mprk_baseline_female = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='Baseline Female')
print(f'MAE: {basemodel_loss_female["MAE"]}')
print(f'RMSE: {basemodel_loss_female["RMSE"]}')
MAE: 0.8527262621582588 RMSE: 1.0601130761239335
plot_model_loss_comparison(losses={'Baseline All': {'mae': basemodel_loss["MAE"], 'rmse': basemodel_loss["RMSE"]},
'Baseline Male': {'mae': basemodel_loss_male["MAE"], 'rmse': basemodel_loss_male["RMSE"]},
'Baseline Female': {'mae': basemodel_loss_female["MAE"], 'rmse': basemodel_loss_female["RMSE"]}},
plot_title='Baseline model comparison')
plot_model_precision_recall_at_k_comparison(data={'Baseline All': mprk_baseline_all,
'Baseline Male': mprk_baseline_male,
'Baseline Female': mprk_baseline_female},
plot_title='Baseline')
The TuriCreate recommender toolkit provides a unified interface to train a variety of recommender models and use them to make recommendations.
train_data = tc.SFrame(ratings_train)
test_data = tc.SFrame(ratings_test)
Implement rating prediction models according to the following models:
Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It is defined to equal the cosine of the angle between them, which is also the same as the inner product of the same vectors normalized to both have length 1.
item_cosine_sim_model = tc.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='cosine')
Warning: Ignoring columns unix_timestamp;
To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
Data has 80000 observations with 943 users and 1650 items.
Data prepared in: 0.120485s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 13.912ms | 100 |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 18.979ms | 0 | 2 |
| 262.353ms | 100 | 1650 |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 0.295729s
y_pred_cosine = item_cosine_sim_model.predict(test_data)
print(f'MAE: {mae(test_data["rating"], y_pred_cosine)}')
print(f'RMSE: {rmse(np.array(test_data["rating"]), np.array(y_pred_cosine))}')
item_cosine_sim_model.evaluate(test_data)
MAE: 3.2668090662272924 RMSE: 3.4523759469176674 Precision and recall summary statistics by cutoff +--------+---------------------+---------------------+ | cutoff | mean_precision | mean_recall | +--------+---------------------+---------------------+ | 1 | 0.5904139433551199 | 0.02627655635316414 | | 2 | 0.5555555555555555 | 0.04729886233977544 | | 3 | 0.5301379811183732 | 0.06427629156051512 | | 4 | 0.5054466230936822 | 0.0819837493495335 | | 5 | 0.49673202614379114 | 0.10107005493937378 | | 6 | 0.47712418300653575 | 0.11393808710889497 | | 7 | 0.4593837535014004 | 0.12560420169882275 | | 8 | 0.44498910675381265 | 0.13513370499162625 | | 9 | 0.4366981360445412 | 0.1467042228785395 | | 10 | 0.4270152505446624 | 0.1575620140381631 | +--------+---------------------+---------------------+ [10 rows x 3 columns] Overall RMSE: 3.4523759469176682 Per User RMSE (best) +---------+-------------------+-------+ | user_id | rmse | count | +---------+-------------------+-------+ | 180 | 1.646567230606569 | 217 | +---------+-------------------+-------+ [1 rows x 3 columns] Per User RMSE (worst) +---------+-------------------+-------+ | user_id | rmse | count | +---------+-------------------+-------+ | 461 | 4.599920243024826 | 1 | +---------+-------------------+-------+ [1 rows x 3 columns] Per Item RMSE (best) +----------+--------------------+-------+ | movie_id | rmse | count | +----------+--------------------+-------+ | 438 | 0.9441616466314433 | 2 | +----------+--------------------+-------+ [1 rows x 3 columns] Per Item RMSE (worst) +----------+------+-------+ | movie_id | rmse | count | +----------+------+-------+ | 1151 | 5.0 | 1 | +----------+------+-------+ [1 rows x 3 columns]
{'precision_recall_by_user': Columns:
user_id int
cutoff int
precision float
recall float
count int
Rows: 8262
Data:
+---------+--------+--------------------+-----------------------+-------+
| user_id | cutoff | precision | recall | count |
+---------+--------+--------------------+-----------------------+-------+
| 0 | 1 | 1.0 | 0.0072992700729927005 | 137 |
| 0 | 2 | 1.0 | 0.014598540145985401 | 137 |
| 0 | 3 | 1.0 | 0.021897810218978103 | 137 |
| 0 | 4 | 1.0 | 0.029197080291970802 | 137 |
| 0 | 5 | 1.0 | 0.0364963503649635 | 137 |
| 0 | 6 | 1.0 | 0.043795620437956206 | 137 |
| 0 | 7 | 1.0 | 0.051094890510948905 | 137 |
| 0 | 8 | 0.875 | 0.051094890510948905 | 137 |
| 0 | 9 | 0.8888888888888888 | 0.058394160583941604 | 137 |
| 0 | 10 | 0.9 | 0.06569343065693431 | 137 |
+---------+--------+--------------------+-----------------------+-------+
[8262 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'precision_recall_overall': Columns:
cutoff int
precision float
recall float
Rows: 18
Data:
+--------+---------------------+----------------------+
| cutoff | precision | recall |
+--------+---------------------+----------------------+
| 1 | 0.5904139433551199 | 0.026276556353164133 |
| 2 | 0.5555555555555556 | 0.04729886233977543 |
| 3 | 0.530137981118373 | 0.06427629156051506 |
| 4 | 0.505446623093682 | 0.08198374934953354 |
| 5 | 0.49673202614379075 | 0.10107005493937381 |
| 6 | 0.4771241830065358 | 0.11393808710889493 |
| 7 | 0.4593837535014007 | 0.12560420169882278 |
| 8 | 0.44498910675381265 | 0.13513370499162625 |
| 9 | 0.4366981360445414 | 0.1467042228785395 |
| 10 | 0.4270152505446621 | 0.15756201403816322 |
+--------+---------------------+----------------------+
[18 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_item': Columns:
movie_id int
rmse float
count int
Rows: 1410
Data:
+----------+--------------------+-------+
| movie_id | rmse | count |
+----------+--------------------+-------+
| 118 | 4.994483330927857 | 1 |
| 660 | 4.051706980833313 | 18 |
| 1236 | 3.0 | 1 |
| 839 | 3.009444466266759 | 14 |
| 699 | 2.3963817593311623 | 4 |
| 567 | 3.1662258020741216 | 52 |
| 773 | 2.8097579253192926 | 8 |
| 1029 | 2.9975773449779246 | 3 |
| 1504 | 4.0 | 1 |
| 435 | 3.5523027952081456 | 25 |
+----------+--------------------+-------+
[1410 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_user': Columns:
user_id int
rmse float
count int
Rows: 459
Data:
+---------+--------------------+-------+
| user_id | rmse | count |
+---------+--------------------+-------+
| 118 | 3.98225662310876 | 83 |
| 435 | 3.745207227595973 | 19 |
| 130 | 3.8312217112370925 | 15 |
| 257 | 3.22722724677265 | 15 |
| 217 | 3.353705815288125 | 20 |
| 232 | 3.9452235323993445 | 42 |
| 310 | 3.6932602949820352 | 146 |
| 49 | 3.5550837309018295 | 11 |
| 424 | 3.0400037802448527 | 26 |
| 13 | 3.784632812061756 | 57 |
+---------+--------------------+-------+
[459 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_overall': 3.4523759469176682}
test_pred = ratings_test.assign(rating_pred=y_pred_cosine)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 0.000000 |
| 1 | 0 | 9 | 3 | 875693118 | 0.034503 |
| 2 | 0 | 11 | 5 | 878542960 | 0.620242 |
| 3 | 0 | 13 | 5 | 874965706 | 0.151729 |
| 4 | 0 | 16 | 3 | 875073198 | 0.066452 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 0.013386 |
| 19996 | 457 | 1100 | 4 | 886397931 | 0.040171 |
| 19997 | 458 | 933 | 3 | 879563639 | 0.069467 |
| 19998 | 459 | 9 | 3 | 882912371 | 0.172382 |
| 19999 | 461 | 681 | 5 | 886365231 | 0.400080 |
20000 rows × 5 columns
mprk_cosine = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='TuriCreate Cosine Similarity')
In statistics, the Pearson correlation coefficient, also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC), or the bivariate correlation, is a measure of linear correlation between two sets of data.
item_pearson_sim_model = tc.item_similarity_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating', similarity_type='pearson')
Warning: Ignoring columns unix_timestamp;
To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
Data has 80000 observations with 943 users and 1650 items.
Data prepared in: 0.119556s
Training model from provided data.
Gathering per-item and per-user statistics.
+--------------------------------+------------+
| Elapsed Time (Item Statistics) | % Complete |
+--------------------------------+------------+
| 24.276ms | 100 |
+--------------------------------+------------+
Setting up lookup tables.
Processing data in one pass using dense lookup tables.
+-------------------------------------+------------------+-----------------+
| Elapsed Time (Constructing Lookups) | Total % Complete | Items Processed |
+-------------------------------------+------------------+-----------------+
| 27.686ms | 0 | 2 |
| 371.582ms | 100 | 1650 |
+-------------------------------------+------------------+-----------------+
Finalizing lookup tables.
Generating candidate set for working with new users.
Finished training in 0.405158s
y_pred_pearson = item_pearson_sim_model.predict(test_data)
print(f'MAE: {mae(test_data["rating"], y_pred_pearson)}')
print(f'RMSE: {rmse(np.array(test_data["rating"]), np.array(y_pred_pearson))}')
item_pearson_sim_model.evaluate(test_data)
MAE: 0.8258110855318893 RMSE: 1.0329022402031238 Precision and recall summary statistics by cutoff +--------+------------------------+------------------------+ | cutoff | mean_precision | mean_recall | +--------+------------------------+------------------------+ | 1 | 0.002178649237472767 | 1.134713144517066e-05 | | 2 | 0.0010893246187363831 | 1.1347131445170659e-05 | | 3 | 0.0007262164124909222 | 1.1347131445170659e-05 | | 4 | 0.0005446623093681916 | 1.1347131445170659e-05 | | 5 | 0.0004357298474945534 | 1.134713144517066e-05 | | 6 | 0.00036310820624546115 | 1.134713144517066e-05 | | 7 | 0.00031123560535325237 | 1.1347131445170659e-05 | | 8 | 0.0005446623093681914 | 3.357824611326011e-05 | | 9 | 0.0009682885499878966 | 0.00010851802450734261 | | 10 | 0.00130718954248366 | 0.00014613572119356663 | +--------+------------------------+------------------------+ [10 rows x 3 columns] Overall RMSE: 1.0329022402031236 Per User RMSE (best) +---------+---------------------+-------+ | user_id | rmse | count | +---------+---------------------+-------+ | 458 | 0.09361950462900381 | 1 | +---------+---------------------+-------+ [1 rows x 3 columns] Per User RMSE (worst) +---------+--------------------+-------+ | user_id | rmse | count | +---------+--------------------+-------+ | 444 | 1.9771522746600034 | 13 | +---------+--------------------+-------+ [1 rows x 3 columns] Per Item RMSE (best) +----------+------+-------+ | movie_id | rmse | count | +----------+------+-------+ | 1537 | 0.0 | 1 | +----------+------+-------+ [1 rows x 3 columns] Per Item RMSE (worst) +----------+------+-------+ | movie_id | rmse | count | +----------+------+-------+ | 1535 | 5.0 | 1 | +----------+------+-------+ [1 rows x 3 columns]
{'precision_recall_by_user': Columns:
user_id int
cutoff int
precision float
recall float
count int
Rows: 8262
Data:
+---------+--------+-----------+--------+-------+
| user_id | cutoff | precision | recall | count |
+---------+--------+-----------+--------+-------+
| 0 | 1 | 0.0 | 0.0 | 137 |
| 0 | 2 | 0.0 | 0.0 | 137 |
| 0 | 3 | 0.0 | 0.0 | 137 |
| 0 | 4 | 0.0 | 0.0 | 137 |
| 0 | 5 | 0.0 | 0.0 | 137 |
| 0 | 6 | 0.0 | 0.0 | 137 |
| 0 | 7 | 0.0 | 0.0 | 137 |
| 0 | 8 | 0.0 | 0.0 | 137 |
| 0 | 9 | 0.0 | 0.0 | 137 |
| 0 | 10 | 0.0 | 0.0 | 137 |
+---------+--------+-----------+--------+-------+
[8262 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'precision_recall_overall': Columns:
cutoff int
precision float
recall float
Rows: 18
Data:
+--------+------------------------+------------------------+
| cutoff | precision | recall |
+--------+------------------------+------------------------+
| 1 | 0.0021786492374727662 | 1.134713144517066e-05 |
| 2 | 0.0010893246187363831 | 1.1347131445170659e-05 |
| 3 | 0.0007262164124909222 | 1.1347131445170659e-05 |
| 4 | 0.0005446623093681916 | 1.1347131445170659e-05 |
| 5 | 0.0004357298474945535 | 1.1347131445170659e-05 |
| 6 | 0.0003631082062454611 | 1.1347131445170659e-05 |
| 7 | 0.00031123560535325237 | 1.1347131445170659e-05 |
| 8 | 0.0005446623093681916 | 3.357824611326011e-05 |
| 9 | 0.0009682885499878965 | 0.00010851802450734263 |
| 10 | 0.0013071895424836603 | 0.00014613572119356668 |
+--------+------------------------+------------------------+
[18 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_item': Columns:
movie_id int
rmse float
count int
Rows: 1410
Data:
+----------+--------------------+-------+
| movie_id | rmse | count |
+----------+--------------------+-------+
| 118 | 0.6670703126020756 | 1 |
| 660 | 0.7662618711933494 | 18 |
| 1236 | 0.6000000000000001 | 1 |
| 839 | 1.1553283313738454 | 14 |
| 699 | 0.9565338331207995 | 4 |
| 567 | 0.9592890906289222 | 52 |
| 773 | 1.429944992377535 | 8 |
| 1029 | 1.8841800437404546 | 3 |
| 1504 | 4.0 | 1 |
| 435 | 0.933689233878814 | 25 |
+----------+--------------------+-------+
[1410 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_user': Columns:
user_id int
rmse float
count int
Rows: 459
Data:
+---------+--------------------+-------+
| user_id | rmse | count |
+---------+--------------------+-------+
| 118 | 1.0444937205148837 | 83 |
| 435 | 1.2246197363788032 | 19 |
| 130 | 0.9366325834803617 | 15 |
| 257 | 1.3339270170900215 | 15 |
| 217 | 0.702368780269201 | 20 |
| 232 | 0.9126853053852092 | 42 |
| 310 | 0.8618965991085349 | 146 |
| 49 | 1.2054251005283296 | 11 |
| 424 | 1.3172450920229635 | 26 |
| 13 | 1.068662795110487 | 57 |
+---------+--------------------+-------+
[459 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_overall': 1.0329022402031236}
test_pred = ratings_test.assign(rating_pred=y_pred_pearson)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 3.405687 |
| 1 | 0 | 9 | 3 | 875693118 | 3.877677 |
| 2 | 0 | 11 | 5 | 878542960 | 4.406993 |
| 3 | 0 | 13 | 5 | 874965706 | 3.902111 |
| 4 | 0 | 16 | 3 | 875073198 | 3.179882 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 3.961275 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.723208 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.906380 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.879852 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.097424 |
20000 rows × 5 columns
mprk_pearson = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='TuriCreate Pearson Correlation')
A recommender based on the similarity between item content rather using user interaction patterns to compute similarity.
def condense_binary_row(data, final_col, binary_cols, to_str=False):
'''Condeses a df row with binary cols to a list of the column names that are set to 1'''
cols = list(binary_cols)
x = data[cols] # Slice the cols
x = list(x[x[cols] == 1].index) # Get only binary positive values
x.sort()
data[final_col] = x # Assign column names list to final column
data.drop(labels=cols, inplace=True) # Drop binary columns
return data
d = pd.merge(pd.Series(ratings_train['movie_id'].unique(), name='movie_id'), movies, on='movie_id')[['movie_id', 'year'] + genre_cols]
content_train = tc.SFrame(d.apply(lambda row: condense_binary_row(row, 'Generes', genre_cols), axis=1))
item_content_model = tc.recommender.item_content_recommender.create(item_data=content_train, item_id='movie_id', observation_data=train_data, user_id='user_id', target='rating')
Applying transform: Class : AutoVectorizer Model Fields ------------ Features : ['year', 'Generes'] Excluded Features : ['movie_id'] Column Type Interpretation Transforms Output Type ------- ---- -------------- ---------- ----------- year int numerical None int Generes list categorical Flatten dict Defaulting to brute force instead of ball tree because there are multiple distance components.
Starting brute force nearest neighbors model training.
Validating distance components.
Initializing model data.
Initializing distances.
Done.
Starting pairwise querying.
+--------------+---------+-------------+--------------+
| Query points | # Pairs | % Complete. | Elapsed Time |
+--------------+---------+-------------+--------------+
| 1 | 1650 | 0.0606061 | 10.802ms |
| Done | | 100 | 337.597ms |
+--------------+---------+-------------+--------------+
Warning: Ignoring columns unix_timestamp;
To use these columns in scoring predictions, use a model that allows the use of additional features.
Preparing data set.
Data has 80000 observations with 943 users and 1650 items.
Data prepared in: 0.289763s
Loading user-provided nearest items.
Generating candidate set for working with new users.
Finished training in 0.043254s
y_pred_content = item_content_model.predict(test_data)
print(f'MAE: {mae(test_data["rating"], y_pred_content)}')
print(f'RMSE: {rmse(np.array(test_data["rating"]), np.array(y_pred_content))}')
item_content_model.evaluate(test_data)
MAE: 3.349401487943707 RMSE: 3.5441401224014406 Precision and recall summary statistics by cutoff +--------+---------------------+----------------------+ | cutoff | mean_precision | mean_recall | +--------+---------------------+----------------------+ | 1 | 0.1459694989106754 | 0.00399187728549197 | | 2 | 0.1318082788671024 | 0.007440228130846247 | | 3 | 0.11328976034858393 | 0.009977820101997957 | | 4 | 0.11274509803921569 | 0.01307143846911446 | | 5 | 0.10893246187363846 | 0.015954642182971428 | | 6 | 0.10784313725490194 | 0.018738603932986587 | | 7 | 0.10364145658263312 | 0.02079145214278566 | | 8 | 0.09967320261437909 | 0.02256903121170482 | | 9 | 0.09803921568627458 | 0.025090184698572437 | | 10 | 0.09782135076252728 | 0.028056399563741403 | +--------+---------------------+----------------------+ [10 rows x 3 columns] Overall RMSE: 3.544140122401443 Per User RMSE (best) +---------+--------------------+-------+ | user_id | rmse | count | +---------+--------------------+-------+ | 180 | 1.6208709661674245 | 217 | +---------+--------------------+-------+ [1 rows x 3 columns] Per User RMSE (worst) +---------+------+-------+ | user_id | rmse | count | +---------+------+-------+ | 461 | 5.0 | 1 | +---------+------+-------+ [1 rows x 3 columns] Per Item RMSE (best) +----------+--------------------+-------+ | movie_id | rmse | count | +----------+--------------------+-------+ | 797 | 0.3231357131781203 | 1 | +----------+--------------------+-------+ [1 rows x 3 columns] Per Item RMSE (worst) +----------+------+-------+ | movie_id | rmse | count | +----------+------+-------+ | 1151 | 5.0 | 1 | +----------+------+-------+ [1 rows x 3 columns]
{'precision_recall_by_user': Columns:
user_id int
cutoff int
precision float
recall float
count int
Rows: 8262
Data:
+---------+--------+---------------------+-----------------------+-------+
| user_id | cutoff | precision | recall | count |
+---------+--------+---------------------+-----------------------+-------+
| 0 | 1 | 0.0 | 0.0 | 137 |
| 0 | 2 | 0.0 | 0.0 | 137 |
| 0 | 3 | 0.0 | 0.0 | 137 |
| 0 | 4 | 0.0 | 0.0 | 137 |
| 0 | 5 | 0.0 | 0.0 | 137 |
| 0 | 6 | 0.0 | 0.0 | 137 |
| 0 | 7 | 0.14285714285714285 | 0.0072992700729927005 | 137 |
| 0 | 8 | 0.125 | 0.0072992700729927005 | 137 |
| 0 | 9 | 0.2222222222222222 | 0.014598540145985401 | 137 |
| 0 | 10 | 0.2 | 0.014598540145985401 | 137 |
+---------+--------+---------------------+-----------------------+-------+
[8262 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'precision_recall_overall': Columns:
cutoff int
precision float
recall float
Rows: 18
Data:
+--------+---------------------+-----------------------+
| cutoff | precision | recall |
+--------+---------------------+-----------------------+
| 1 | 0.14596949891067537 | 0.00399187728549197 |
| 2 | 0.1318082788671025 | 0.0074402281308462445 |
| 3 | 0.11328976034858394 | 0.009977820101997957 |
| 4 | 0.1127450980392157 | 0.013071438469114459 |
| 5 | 0.10893246187363843 | 0.015954642182971424 |
| 6 | 0.10784313725490191 | 0.018738603932986583 |
| 7 | 0.1036414565826331 | 0.020791452142785657 |
| 8 | 0.09967320261437909 | 0.022569031211704827 |
| 9 | 0.09803921568627458 | 0.025090184698572434 |
| 10 | 0.09782135076252726 | 0.028056399563741407 |
+--------+---------------------+-----------------------+
[18 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_item': Columns:
movie_id int
rmse float
count int
Rows: 1410
Data:
+----------+--------------------+-------+
| movie_id | rmse | count |
+----------+--------------------+-------+
| 118 | 4.954207722746998 | 1 |
| 660 | 4.178508221013939 | 18 |
| 1236 | 3.0 | 1 |
| 839 | 2.9341944676908955 | 14 |
| 699 | 2.1882619595481807 | 4 |
| 567 | 3.6839578265681707 | 52 |
| 773 | 2.4817822848247157 | 8 |
| 1029 | 2.7141173226309756 | 3 |
| 1504 | 4.0 | 1 |
| 435 | 3.3698208460943455 | 25 |
+----------+--------------------+-------+
[1410 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_user': Columns:
user_id int
rmse float
count int
Rows: 459
Data:
+---------+--------------------+-------+
| user_id | rmse | count |
+---------+--------------------+-------+
| 118 | 3.9278414196522173 | 83 |
| 435 | 3.7993193292390863 | 19 |
| 130 | 3.881990531457918 | 15 |
| 257 | 3.855888659330392 | 15 |
| 217 | 3.4996199384907856 | 20 |
| 232 | 4.1874992302961 | 42 |
| 310 | 3.785011615194503 | 146 |
| 49 | 3.4319857708539514 | 11 |
| 424 | 3.0591991123939968 | 26 |
| 13 | 3.9775941213135435 | 57 |
+---------+--------------------+-------+
[459 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_overall': 3.544140122401443}
test_pred = ratings_test.assign(rating_pred=y_pred_content)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 0.128994 |
| 1 | 0 | 9 | 3 | 875693118 | 0.042744 |
| 2 | 0 | 11 | 5 | 878542960 | 0.076707 |
| 3 | 0 | 13 | 5 | 874965706 | 0.214417 |
| 4 | 0 | 16 | 3 | 875073198 | 0.016845 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 0.049140 |
| 19996 | 457 | 1100 | 4 | 886397931 | 0.067293 |
| 19997 | 458 | 933 | 3 | 879563639 | 0.000000 |
| 19998 | 459 | 9 | 3 | 882912371 | 0.044776 |
| 19999 | 461 | 681 | 5 | 886365231 | 0.000000 |
20000 rows × 5 columns
mprk_content = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='TuriCreate Item Content')
--------------------------------------------------------------------------- NameError Traceback (most recent call last) <ipython-input-1-e214dc44a4d3> in <module>() ----> 1 mprk_content = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='TuriCreate Item Content') NameError: name 'plot_model_precision_recall_at_k' is not defined
Matrix factorization is a class of collaborative filtering algorithms used in recommender systems. Matrix factorization algorithms work by decomposing the user-item interaction matrix into the product of two lower dimensionality rectangular matrices.
mf_model = tc.recommender.factorization_recommender.create(train_data, user_id='user_id', item_id='movie_id', target='rating')
Preparing data set.
Data has 80000 observations with 943 users and 1650 items.
Data prepared in: 0.119615s
Training factorization_recommender for recommendations.
+--------------------------------+--------------------------------------------------+----------+
| Parameter | Description | Value |
+--------------------------------+--------------------------------------------------+----------+
| num_factors | Factor Dimension | 8 |
| regularization | L2 Regularization on Factors | 1e-08 |
| solver | Solver used for training | adagrad |
| linear_regularization | L2 Regularization on Linear Coefficients | 1e-10 |
| max_iterations | Maximum Number of Iterations | 50 |
+--------------------------------+--------------------------------------------------+----------+
Optimizing model using SGD; tuning step size.
Using 10000 / 80000 points for tuning the step size.
+---------+-------------------+------------------------------------------+
| Attempt | Initial Step Size | Estimated Objective Value |
+---------+-------------------+------------------------------------------+
| 0 | 16.6667 | Not Viable |
| 1 | 4.16667 | Not Viable |
| 2 | 1.04167 | Not Viable |
| 3 | 0.260417 | 0.173785 |
| 4 | 0.130208 | 0.334914 |
| 5 | 0.0651042 | 0.601517 |
+---------+-------------------+------------------------------------------+
| Final | 0.260417 | 0.173785 |
+---------+-------------------+------------------------------------------+
Starting Optimization.
+---------+--------------+-------------------+-----------------------+-------------+
| Iter. | Elapsed Time | Approx. Objective | Approx. Training RMSE | Step Size |
+---------+--------------+-------------------+-----------------------+-------------+
| Initial | 132us | 1.25115 | 1.11855 | |
+---------+--------------+-------------------+-----------------------+-------------+
| 1 | 104.227ms | DIVERGED | DIVERGED | 0.260417 |
| RESET | 122.977ms | 1.25115 | 1.11855 | |
| 1 | 187.681ms | 1.1799 | 1.08623 | 0.130208 |
| 2 | 243.564ms | 0.908494 | 0.953148 | 0.130208 |
| 3 | 311.02ms | 0.875735 | 0.935806 | 0.130208 |
| 4 | 352.024ms | 0.855887 | 0.92514 | 0.130208 |
| 5 | 417.665ms | 0.839543 | 0.916264 | 0.130208 |
| 9 | 630.25ms | 0.798271 | 0.893458 | 0.130208 |
| 49 | 2.68s | 0.641143 | 0.800705 | 0.130208 |
| 50 | 2.75s | 0.638991 | 0.79936 | 0.130208 |
+---------+--------------+-------------------+-----------------------+-------------+
Optimization Complete: Maximum number of passes through the data reached.
Computing final objective value and training RMSE.
Final objective value: 0.620772
Final training RMSE: 0.787882
y_pred_mf = mf_model.predict(test_data)
print(f'MAE: {mae(test_data["rating"], y_pred_mf)}')
print(f'RMSE: {rmse(np.array(test_data["rating"]), np.array(y_pred_mf))}')
mf_model.evaluate(test_data)
MAE: 0.7592660491639456 RMSE: 0.9706173422580765 Precision and recall summary statistics by cutoff +--------+---------------------+----------------------+ | cutoff | mean_precision | mean_recall | +--------+---------------------+----------------------+ | 1 | 0.16557734204793026 | 0.00526882588907496 | | 2 | 0.16122004357298475 | 0.009331126392212445 | | 3 | 0.15831517792302105 | 0.013050216397672225 | | 4 | 0.15631808278867096 | 0.01656749454074667 | | 5 | 0.14901960784313723 | 0.019136219428219754 | | 6 | 0.1427015250544662 | 0.021669656407745114 | | 7 | 0.1400560224089636 | 0.024627216985494017 | | 8 | 0.13671023965141602 | 0.026788140844956013 | | 9 | 0.1353183248608085 | 0.0294314764760266 | | 10 | 0.13464052287581704 | 0.032400857433660626 | +--------+---------------------+----------------------+ [10 rows x 3 columns] Overall RMSE: 0.9706173422580769 Per User RMSE (best) +---------+---------------------+-------+ | user_id | rmse | count | +---------+---------------------+-------+ | 458 | 0.06960571643429248 | 1 | +---------+---------------------+-------+ [1 rows x 3 columns] Per User RMSE (worst) +---------+--------------------+-------+ | user_id | rmse | count | +---------+--------------------+-------+ | 35 | 1.8923654297437629 | 11 | +---------+--------------------+-------+ [1 rows x 3 columns] Per Item RMSE (best) +----------+---------------------+-------+ | movie_id | rmse | count | +----------+---------------------+-------+ | 1235 | 0.01691340597193358 | 1 | +----------+---------------------+-------+ [1 rows x 3 columns] Per Item RMSE (worst) +----------+--------------------+-------+ | movie_id | rmse | count | +----------+--------------------+-------+ | 912 | 3.5508890322522575 | 1 | +----------+--------------------+-------+ [1 rows x 3 columns]
{'precision_recall_by_user': Columns:
user_id int
cutoff int
precision float
recall float
count int
Rows: 8262
Data:
+---------+--------+-----------+--------+-------+
| user_id | cutoff | precision | recall | count |
+---------+--------+-----------+--------+-------+
| 0 | 1 | 0.0 | 0.0 | 137 |
| 0 | 2 | 0.0 | 0.0 | 137 |
| 0 | 3 | 0.0 | 0.0 | 137 |
| 0 | 4 | 0.0 | 0.0 | 137 |
| 0 | 5 | 0.0 | 0.0 | 137 |
| 0 | 6 | 0.0 | 0.0 | 137 |
| 0 | 7 | 0.0 | 0.0 | 137 |
| 0 | 8 | 0.0 | 0.0 | 137 |
| 0 | 9 | 0.0 | 0.0 | 137 |
| 0 | 10 | 0.0 | 0.0 | 137 |
+---------+--------+-----------+--------+-------+
[8262 rows x 5 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'precision_recall_overall': Columns:
cutoff int
precision float
recall float
Rows: 18
Data:
+--------+---------------------+----------------------+
| cutoff | precision | recall |
+--------+---------------------+----------------------+
| 1 | 0.1655773420479303 | 0.005268825889074959 |
| 2 | 0.16122004357298472 | 0.009331126392212445 |
| 3 | 0.1583151779230211 | 0.013050216397672225 |
| 4 | 0.15631808278867101 | 0.016567494540746665 |
| 5 | 0.14901960784313717 | 0.01913621942821975 |
| 6 | 0.14270152505446626 | 0.021669656407745114 |
| 7 | 0.14005602240896364 | 0.02462721698549402 |
| 8 | 0.1367102396514161 | 0.026788140844956006 |
| 9 | 0.1353183248608085 | 0.0294314764760266 |
| 10 | 0.13464052287581701 | 0.03240085743366061 |
+--------+---------------------+----------------------+
[18 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_item': Columns:
movie_id int
rmse float
count int
Rows: 1410
Data:
+----------+---------------------+-------+
| movie_id | rmse | count |
+----------+---------------------+-------+
| 118 | 0.7910478192527073 | 1 |
| 660 | 0.5890040053092241 | 18 |
| 1236 | 1.5275962492820572 | 1 |
| 839 | 1.0178424368185373 | 14 |
| 699 | 1.2620302367835743 | 4 |
| 567 | 0.8120727775658789 | 52 |
| 773 | 1.1272444246543243 | 8 |
| 1029 | 1.6125504979801046 | 3 |
| 1504 | 0.17916466691770827 | 1 |
| 435 | 0.9238922028472576 | 25 |
+----------+---------------------+-------+
[1410 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_by_user': Columns:
user_id int
rmse float
count int
Rows: 459
Data:
+---------+--------------------+-------+
| user_id | rmse | count |
+---------+--------------------+-------+
| 118 | 0.958002504778921 | 83 |
| 435 | 1.2969352371657072 | 19 |
| 130 | 0.9306407250597798 | 15 |
| 257 | 1.2873917228646043 | 15 |
| 217 | 0.6888458397922261 | 20 |
| 232 | 0.7228751539558613 | 42 |
| 310 | 0.8527070090615276 | 146 |
| 49 | 1.2664917913911427 | 11 |
| 424 | 1.1889360649909475 | 26 |
| 13 | 1.077035553861223 | 57 |
+---------+--------------------+-------+
[459 rows x 3 columns]
Note: Only the head of the SFrame is printed.
You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns.,
'rmse_overall': 0.9706173422580769}
test_pred = ratings_test.assign(rating_pred=y_pred_mf)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 3.064887 |
| 1 | 0 | 9 | 3 | 875693118 | 4.112834 |
| 2 | 0 | 11 | 5 | 878542960 | 4.530835 |
| 3 | 0 | 13 | 5 | 874965706 | 4.147889 |
| 4 | 0 | 16 | 3 | 875073198 | 3.788614 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 3.962348 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.644977 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.930394 |
| 19998 | 459 | 9 | 3 | 882912371 | 4.017355 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.592513 |
20000 rows × 5 columns
mprk_mf = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='TuriCreate Matrix Facorization')
plot_model_loss_comparison(losses={'TuriCreate Cosine Similarity': {'mae': mae(test_data["rating"], y_pred_cosine), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_cosine))},
'TuriCreate Pearson Correlation': {'mae': mae(test_data["rating"], y_pred_pearson), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_pearson))},
'TuriCreate Item Content': {'mae': mae(test_data["rating"], y_pred_content), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_content))},
'TuriCreate Matrix Facorization': {'mae': mae(test_data["rating"], y_pred_mf), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_mf))},
},
plot_title='TuriCreate models comparison', high=True)
plot_model_precision_recall_at_k_comparison(data={'TuriCreate Cosine Similarity': mprk_cosine,
'TuriCreate Pearson Correlation': mprk_pearson,
'TuriCreate Item Content': mprk_content,
'TuriCreate Matrix Facorization': mprk_mf},
plot_title='TuriCreate')
models_performance = tc.recommender.util.compare_models = (test_data, [item_content_model, item_cosine_sim_model, item_pearson_sim_model, mf_model])
print(models_performance)
(Columns: user_id int movie_id int rating int unix_timestamp int Rows: 20000 Data: +---------+----------+--------+----------------+ | user_id | movie_id | rating | unix_timestamp | +---------+----------+--------+----------------+ | 0 | 5 | 5 | 887431973 | | 0 | 9 | 3 | 875693118 | | 0 | 11 | 5 | 878542960 | | 0 | 13 | 5 | 874965706 | | 0 | 16 | 3 | 875073198 | | 0 | 19 | 4 | 887431883 | | 0 | 22 | 4 | 875072895 | | 0 | 23 | 3 | 875071713 | | 0 | 26 | 2 | 876892946 | | 0 | 30 | 3 | 875072144 | +---------+----------+--------+----------------+ [20000 rows x 4 columns] Note: Only the head of the SFrame is printed. You can use print_rows(num_rows=m, num_columns=n) to print more rows and columns., [Class : ItemContentRecommender Schema ------ User ID : user_id Item ID : movie_id Target : rating Additional observation features : 0 User side features : [] Item side features : ['movie_id', 'year', 'Generes'] Statistics ---------- Number of observations : 80000 Number of users : 943 Number of items : 1650 Training summary ---------------- Training time : 0.0433 Model Parameters ---------------- Model class : ItemContentRecommender threshold : 0.0001 similarity_type : cosine training_method : auto Other Settings -------------- degree_approximation_threshold : 4096 max_data_passes : 4096 max_item_neighborhood_size : 64 nearest_neighbors_interaction_proportion_threshold : 0.05 seed_item_set_size : 50 sparse_density_estimation_sample_size : 4096 target_memory_usage : 8589934592 , Class : ItemSimilarityRecommender Schema ------ User ID : user_id Item ID : movie_id Target : rating Additional observation features : 0 User side features : [] Item side features : [] Statistics ---------- Number of observations : 80000 Number of users : 943 Number of items : 1650 Training summary ---------------- Training time : 0.2958 Model Parameters ---------------- Model class : ItemSimilarityRecommender threshold : 0.001 similarity_type : cosine training_method : auto Other Settings -------------- degree_approximation_threshold : 4096 max_data_passes : 4096 max_item_neighborhood_size : 64 nearest_neighbors_interaction_proportion_threshold : 0.05 seed_item_set_size : 50 sparse_density_estimation_sample_size : 4096 target_memory_usage : 8589934592 , Class : ItemSimilarityRecommender Schema ------ User ID : user_id Item ID : movie_id Target : rating Additional observation features : 0 User side features : [] Item side features : [] Statistics ---------- Number of observations : 80000 Number of users : 943 Number of items : 1650 Training summary ---------------- Training time : 0.4052 Model Parameters ---------------- Model class : ItemSimilarityRecommender threshold : 0.001 similarity_type : pearson training_method : auto Other Settings -------------- degree_approximation_threshold : 4096 max_data_passes : 4096 max_item_neighborhood_size : 64 nearest_neighbors_interaction_proportion_threshold : 0.05 seed_item_set_size : 50 sparse_density_estimation_sample_size : 4096 target_memory_usage : 8589934592 , Class : FactorizationRecommender Schema ------ User ID : user_id Item ID : movie_id Target : rating Additional observation features : 1 User side features : [] Item side features : [] Statistics ---------- Number of observations : 80000 Number of users : 943 Number of items : 1650 Training summary ---------------- Training time : 3.2615 Model Parameters ---------------- Model class : FactorizationRecommender num_factors : 8 binary_target : 0 side_data_factorization : 1 solver : auto nmf : 0 max_iterations : 50 Regularization Settings ----------------------- regularization : 0.0 regularization_type : normal linear_regularization : 0.0 Optimization Settings --------------------- init_random_sigma : 0.01 sgd_convergence_interval : 4 sgd_convergence_threshold : 0.0 sgd_max_trial_iterations : 5 sgd_sampling_block_size : 131072 sgd_step_adjustment_interval : 4 sgd_step_size : 0.0 sgd_trial_sample_minimum_size : 10000 sgd_trial_sample_proportion : 0.125 step_size_decrease_rate : 0.75 additional_iterations_if_unhealthy : 5 adagrad_momentum_weighting : 0.9 num_tempering_iterations : 4 tempering_regularization_start_value : 0.0 track_exact_loss : 0 ])
plot_model_loss_comparison(losses={'Baseline All': {'mae': basemodel_loss["MAE"], 'rmse': basemodel_loss["RMSE"]},
'TuriCreate Cosine Similarity': {'mae': mae(test_data["rating"], y_pred_cosine), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_cosine))},
'TuriCreate Pearson Correlation': {'mae': mae(test_data["rating"], y_pred_pearson), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_pearson))},
'TuriCreate Item Content': {'mae': mae(test_data["rating"], y_pred_content), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_content))},
'TuriCreate Matrix Facorization': {'mae': mae(test_data["rating"], y_pred_mf), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_mf))},
},
plot_title='TuriCreate models comparison', high=True)
NCF has 2 components GMF (Generalized Matrix Factorisation )* and MLP(Multi-Layer Perceptron) with the following benefits GMF that applies the linear kernel to model user-item interactions like MF.
MLP that uses multiple neural layers to layer nonlinear interactions NCF combines these models together to superimpose their desirable characteristics. NCF concatenates the output of GMF and MLP before feeding them into NCF layer.
MODEL_WEIGHTS_FILE_PREFIX = 'keras_model_'
KERAS_MODEL_PATH = 'keras_models'
KERAS_MODEL_FULL_PATH = join(getcwd(), KERAS_MODEL_PATH)
if not isdir(KERAS_MODEL_FULL_PATH):
mkdir(KERAS_MODEL_FULL_PATH)
def root_mean_squared_error(y_true, y_pred):
return K.sqrt(K.mean(K.square(y_pred - y_true)))
def min_max_normalize(value, min, max):
return (value - min) / (max - min)
def reverse_min_max_normalize(value, min, max):
return value * (max - min) + min
def get_min_max_rating(df):
min_rating = df['rating'].drop_duplicates().astype(np.float32).min()
max_rating = df['rating'].drop_duplicates().astype(np.float32).max()
return max_rating, min_rating
Assign user ids and movie ids to training data map
max_rating, min_rating = get_min_max_rating(ratings)
train = ratings_train[['user_id', 'movie_id']].values
X_train_map = {'user_id': train[:, 0], 'movie_id': train[:, 1]}
y_train = ratings_train['rating'].values.astype(np.float32)
test = ratings_test[['user_id', 'movie_id']].values
X_test_map = {'user_id': test[:, 0], 'movie_id': test[:, 1]}
y_test = ratings_test['rating'].values.astype(np.float32)
print(f'Train data shapes: user_id: {X_train_map["user_id"].shape}, movie_id: {X_train_map["movie_id"].shape}, y train (rating): {y_train.shape}')
print(f'Test data shapes: user_id: {X_test_map["user_id"].shape}, movie_id: {X_test_map["movie_id"].shape}, y test (rating): {y_test.shape}')
Train data shapes: user_id: (80000,), movie_id: (80000,), y train (rating): (80000,) Test data shapes: user_id: (20000,), movie_id: (20000,), y test (rating): (20000,)
def plot_history(history, plot_title):
metric_map = {'mae': 'MAE', 'root_mean_squared_error': 'RMSE'}
fig, axes = plt.subplots(nrows=1, ncols=len(metric_map), figsize=(len(metric_map) * 7, 5))
history = history.history
# Plot training & validation metrics
for i, metric in enumerate(metric_map.keys()):
axes[i].plot(history[metric])
axes[i].plot(history[f'val_{metric}'])
axes[i].legend(['Train', 'Validation'], loc='upper right')
axes[i].set_title(f'Model {metric_map[metric]} Metric')
axes[i].set_ylabel(metric_map[metric])
axes[i].set_xlabel('Epoch')
axes[i].grid()
fig.suptitle(plot_title + " Metrics History")
plt.tight_layout(rect=[0, 0.03, 1, 0.95])
plt.show()
class TrainTime(Callback):
'''Used to print elapsed time in seconds in the end of the training process'''
def on_train_begin(self, logs={}):
self.train_time_start = time.time()
def on_train_end(self, logs={}):
self.train_time_end = time.time() - self.train_time_start
print(f'Total training time: {self.train_time_end} sec')
def create_callbacks(model_name):
train_time_callback = TrainTime()
return train_time_callback, [train_time_callback, EarlyStopping('val_loss', patience=10),
ModelCheckpoint(join(KERAS_MODEL_FULL_PATH, MODEL_WEIGHTS_FILE_PREFIX + model_name + '.h5'), save_best_only=True)]
We will implement an NCF (Neural Collaborative Filtering) model starting with 1 hidden dense layer.
def create_ncf_model_1(num_users, num_movies, latent_dim, hidden_dim, loss_f, lr, dropout):
user_input = Input(shape=(1,), dtype='int32', name='user_input')
movie_input = Input(shape=(1,),dtype='int32', name='movie_input')
# Matrix factorization embedding layer
mf_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='mf_user_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
mf_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='mf_movie_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
mf_user_latent = Flatten(name='mf_user_latent')(mf_embedding_user(user_input))
mf_movie_latent = Flatten(name='mf_movie_latent')(mf_embedding_movie(movie_input))
# MLP embedding layer
mlp_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='mlp_user_embedding', input_length=1)
mlp_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='mlp_movie_embedding', input_length=1)
mlp_user_latent = Flatten(name='mlp_user_latent')(mf_embedding_user(user_input))
mlp_movie_latent = Flatten(name='mlp_movie_latent')(mf_embedding_movie(movie_input))
# Element-wise product of user and movie embeddings
gmf = Multiply(name='gmf_user_movie_element_wise_product')([mf_user_latent, mf_movie_latent])
# MLP concatination
mlp = Concatenate(name='mlp_concat')([mlp_user_latent, mlp_movie_latent])
# MLP hidden layers
drop1 = Dropout(0.1, name='drop1_0.05')(mlp)
hidden1 = Dense(hidden_dim, activation='relu')(drop1)
drop2 = Dropout(dropout, name=f'drop2_{dropout}')(hidden1)
# GMF & MLP concatination
ncf_concat = Concatenate(name='ncf_concat')([gmf, drop2])
x = Dense(1, activation='sigmoid', kernel_initializer='he_normal', name='prediction')(ncf_concat)
# Since we used sigmoid [0, 1] output from previous layer we reverse-normalize to our range [min_rating, max_rating]
pred = Lambda(lambda value: reverse_min_max_normalize(value, min_rating, max_rating), name='min_max_normalize')(x)
model = Model(inputs=[user_input, movie_input], outputs=pred) # , gender_input, age_input
model.compile(loss=loss_f, optimizer=Adam(lr=lr), metrics=['mae', root_mean_squared_error])
return model
ncf_model_1 = create_ncf_model_1(num_users=rating_user_count, num_movies=rated_movie_count, latent_dim=20, hidden_dim=20, loss_f='mse', lr=0.001, dropout=0.5)
ncf_model_1.summary()
Model: "model_29"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
user_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
movie_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
mf_user_embedding (Embedding) (None, 1, 20) 18860 user_input[0][0]
user_input[0][0]
__________________________________________________________________________________________________
mf_movie_embedding (Embedding) (None, 1, 20) 33640 movie_input[0][0]
movie_input[0][0]
__________________________________________________________________________________________________
mlp_user_latent (Flatten) (None, 20) 0 mf_user_embedding[1][0]
__________________________________________________________________________________________________
mlp_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[1][0]
__________________________________________________________________________________________________
mlp_concat (Concatenate) (None, 40) 0 mlp_user_latent[0][0]
mlp_movie_latent[0][0]
__________________________________________________________________________________________________
drop1_0.05 (Dropout) (None, 40) 0 mlp_concat[0][0]
__________________________________________________________________________________________________
mf_user_latent (Flatten) (None, 20) 0 mf_user_embedding[0][0]
__________________________________________________________________________________________________
mf_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[0][0]
__________________________________________________________________________________________________
dense_37 (Dense) (None, 20) 820 drop1_0.05[0][0]
__________________________________________________________________________________________________
gmf_user_movie_element_wise_pro (None, 20) 0 mf_user_latent[0][0]
mf_movie_latent[0][0]
__________________________________________________________________________________________________
drop2_0.5 (Dropout) (None, 20) 0 dense_37[0][0]
__________________________________________________________________________________________________
ncf_concat (Concatenate) (None, 40) 0 gmf_user_movie_element_wise_produ
drop2_0.5[0][0]
__________________________________________________________________________________________________
prediction (Dense) (None, 1) 41 ncf_concat[0][0]
__________________________________________________________________________________________________
min_max_normalize (Lambda) (None, 1) 0 prediction[0][0]
==================================================================================================
Total params: 53,361
Trainable params: 53,361
Non-trainable params: 0
__________________________________________________________________________________________________
plot_model(ncf_model_1, "ncf_model_1.png", show_shapes=True)
train_time_callback_1, callbacks = create_callbacks('ncf_model_1')
ncf_model_1_history = ncf_model_1.fit([X_train_map['user_id'], X_train_map['movie_id']], y_train, epochs=25, validation_split=.1, verbose=1, callbacks=callbacks, batch_size=32)
Epoch 1/25 2250/2250 [==============================] - 7s 3ms/step - loss: 1.1276 - mae: 0.8597 - root_mean_squared_error: 1.0524 - val_loss: 0.9760 - val_mae: 0.7931 - val_root_mean_squared_error: 0.9657 Epoch 2/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.9099 - mae: 0.7581 - root_mean_squared_error: 0.9455 - val_loss: 0.9730 - val_mae: 0.7920 - val_root_mean_squared_error: 0.9637 Epoch 3/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8818 - mae: 0.7443 - root_mean_squared_error: 0.9312 - val_loss: 0.9715 - val_mae: 0.7918 - val_root_mean_squared_error: 0.9627 Epoch 4/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8607 - mae: 0.7362 - root_mean_squared_error: 0.9190 - val_loss: 0.9830 - val_mae: 0.8000 - val_root_mean_squared_error: 0.9685 Epoch 5/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8374 - mae: 0.7255 - root_mean_squared_error: 0.9062 - val_loss: 0.9773 - val_mae: 0.7956 - val_root_mean_squared_error: 0.9657 Epoch 6/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8253 - mae: 0.7186 - root_mean_squared_error: 0.9000 - val_loss: 0.9850 - val_mae: 0.8031 - val_root_mean_squared_error: 0.9704 Epoch 7/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.8030 - mae: 0.7075 - root_mean_squared_error: 0.8876 - val_loss: 0.9770 - val_mae: 0.7998 - val_root_mean_squared_error: 0.9666 Epoch 8/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7642 - mae: 0.6886 - root_mean_squared_error: 0.8663 - val_loss: 0.9979 - val_mae: 0.8168 - val_root_mean_squared_error: 0.9783 Epoch 9/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.7081 - mae: 0.6619 - root_mean_squared_error: 0.8327 - val_loss: 1.0020 - val_mae: 0.8216 - val_root_mean_squared_error: 0.9808 Epoch 10/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.6579 - mae: 0.6365 - root_mean_squared_error: 0.8029 - val_loss: 1.0172 - val_mae: 0.8312 - val_root_mean_squared_error: 0.9889 Epoch 11/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.6067 - mae: 0.6085 - root_mean_squared_error: 0.7694 - val_loss: 1.0580 - val_mae: 0.8583 - val_root_mean_squared_error: 1.0097 Epoch 12/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.5592 - mae: 0.5827 - root_mean_squared_error: 0.7394 - val_loss: 1.0565 - val_mae: 0.8582 - val_root_mean_squared_error: 1.0089 Epoch 13/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.5123 - mae: 0.5564 - root_mean_squared_error: 0.7074 - val_loss: 1.0874 - val_mae: 0.8739 - val_root_mean_squared_error: 1.0239 Total training time: 75.4180154800415 sec
plot_history(ncf_model_1_history, plot_title='NCF V1 model')
We can clearly see that our initial Neural Collaborative Filtering model suffers from overfitting.
(validation loss increasing while training loss decreasing)
In our second iteration we will try to solve it by:
loss_metrics_ncf_1 = ncf_model_1.evaluate([X_test_map['user_id'], X_test_map['movie_id']], y_test)
print(f'MAE: {loss_metrics_ncf_1[1]}')
print(f'RMSE: {loss_metrics_ncf_1[2]}')
625/625 [==============================] - 1s 1ms/step - loss: 1.0128 - mae: 0.7796 - root_mean_squared_error: 0.9822 MAE: 0.7796024680137634 RMSE: 0.9821773171424866
prediction = ncf_model_1.predict([X_test_map['user_id'], X_test_map['movie_id']], batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 3.800531 |
| 1 | 0 | 9 | 3 | 875693118 | 4.134938 |
| 2 | 0 | 11 | 5 | 878542960 | 4.930750 |
| 3 | 0 | 13 | 5 | 874965706 | 4.294693 |
| 4 | 0 | 16 | 3 | 875073198 | 3.214992 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 3.928123 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.894881 |
| 19997 | 458 | 933 | 3 | 879563639 | 3.131896 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.934884 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.312722 |
20000 rows × 5 columns
mprk_ncf_1 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='NCF Model V1')
ncf_model_2 = create_ncf_model_1(num_users=rating_user_count, num_movies=rated_movie_count, latent_dim=20, hidden_dim=20, loss_f='mae', lr=0.0001, dropout=0.5)
ncf_model_2.summary()
Model: "model_30"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
user_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
movie_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
mf_user_embedding (Embedding) (None, 1, 20) 18860 user_input[0][0]
user_input[0][0]
__________________________________________________________________________________________________
mf_movie_embedding (Embedding) (None, 1, 20) 33640 movie_input[0][0]
movie_input[0][0]
__________________________________________________________________________________________________
mlp_user_latent (Flatten) (None, 20) 0 mf_user_embedding[1][0]
__________________________________________________________________________________________________
mlp_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[1][0]
__________________________________________________________________________________________________
mlp_concat (Concatenate) (None, 40) 0 mlp_user_latent[0][0]
mlp_movie_latent[0][0]
__________________________________________________________________________________________________
drop1_0.05 (Dropout) (None, 40) 0 mlp_concat[0][0]
__________________________________________________________________________________________________
mf_user_latent (Flatten) (None, 20) 0 mf_user_embedding[0][0]
__________________________________________________________________________________________________
mf_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[0][0]
__________________________________________________________________________________________________
dense_38 (Dense) (None, 20) 820 drop1_0.05[0][0]
__________________________________________________________________________________________________
gmf_user_movie_element_wise_pro (None, 20) 0 mf_user_latent[0][0]
mf_movie_latent[0][0]
__________________________________________________________________________________________________
drop2_0.5 (Dropout) (None, 20) 0 dense_38[0][0]
__________________________________________________________________________________________________
ncf_concat (Concatenate) (None, 40) 0 gmf_user_movie_element_wise_produ
drop2_0.5[0][0]
__________________________________________________________________________________________________
prediction (Dense) (None, 1) 41 ncf_concat[0][0]
__________________________________________________________________________________________________
min_max_normalize (Lambda) (None, 1) 0 prediction[0][0]
==================================================================================================
Total params: 53,361
Trainable params: 53,361
Non-trainable params: 0
__________________________________________________________________________________________________
train_time_callback_2, callbacks = create_callbacks('ncf_model_2')
ncf_model_2_history = ncf_model_2.fit([X_train_map['user_id'], X_train_map['movie_id']], y_train, epochs=25, validation_split=.1, verbose=1, callbacks=callbacks, batch_size=32)
Epoch 1/25 2250/2250 [==============================] - 7s 3ms/step - loss: 0.9810 - mae: 0.9809 - root_mean_squared_error: 1.1900 - val_loss: 0.8745 - val_mae: 0.8745 - val_root_mean_squared_error: 1.0243 Epoch 2/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.8459 - mae: 0.8458 - root_mean_squared_error: 1.0397 - val_loss: 0.7930 - val_mae: 0.7929 - val_root_mean_squared_error: 0.9753 Epoch 3/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7802 - mae: 0.7801 - root_mean_squared_error: 0.9828 - val_loss: 0.7828 - val_mae: 0.7827 - val_root_mean_squared_error: 0.9673 Epoch 4/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7584 - mae: 0.7583 - root_mean_squared_error: 0.9615 - val_loss: 0.7799 - val_mae: 0.7798 - val_root_mean_squared_error: 0.9658 Epoch 5/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7539 - mae: 0.7538 - root_mean_squared_error: 0.9568 - val_loss: 0.7784 - val_mae: 0.7783 - val_root_mean_squared_error: 0.9641 Epoch 6/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.7475 - mae: 0.7475 - root_mean_squared_error: 0.9520 - val_loss: 0.7772 - val_mae: 0.7771 - val_root_mean_squared_error: 0.9634 Epoch 7/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7447 - mae: 0.7446 - root_mean_squared_error: 0.9513 - val_loss: 0.7776 - val_mae: 0.7775 - val_root_mean_squared_error: 0.9632 Epoch 8/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7332 - mae: 0.7331 - root_mean_squared_error: 0.9395 - val_loss: 0.7780 - val_mae: 0.7779 - val_root_mean_squared_error: 0.9634 Epoch 9/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7328 - mae: 0.7327 - root_mean_squared_error: 0.9411 - val_loss: 0.7792 - val_mae: 0.7791 - val_root_mean_squared_error: 0.9638 Epoch 10/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7277 - mae: 0.7276 - root_mean_squared_error: 0.9373 - val_loss: 0.7794 - val_mae: 0.7793 - val_root_mean_squared_error: 0.9652 Epoch 11/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7241 - mae: 0.7240 - root_mean_squared_error: 0.9331 - val_loss: 0.7797 - val_mae: 0.7796 - val_root_mean_squared_error: 0.9654 Epoch 12/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7243 - mae: 0.7241 - root_mean_squared_error: 0.9363 - val_loss: 0.7797 - val_mae: 0.7796 - val_root_mean_squared_error: 0.9660 Epoch 13/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7236 - mae: 0.7235 - root_mean_squared_error: 0.9393 - val_loss: 0.7804 - val_mae: 0.7803 - val_root_mean_squared_error: 0.9669 Epoch 14/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7193 - mae: 0.7192 - root_mean_squared_error: 0.9312 - val_loss: 0.7810 - val_mae: 0.7809 - val_root_mean_squared_error: 0.9682 Epoch 15/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7170 - mae: 0.7169 - root_mean_squared_error: 0.9309 - val_loss: 0.7814 - val_mae: 0.7813 - val_root_mean_squared_error: 0.9688 Epoch 16/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7176 - mae: 0.7175 - root_mean_squared_error: 0.9317 - val_loss: 0.7827 - val_mae: 0.7826 - val_root_mean_squared_error: 0.9703 Total training time: 86.43772721290588 sec
plot_history(ncf_model_2_history, plot_title='NCF V2 model')
We can already see an improvement, our model suffers less from overfitting.
loss_metrics_ncf_2 = ncf_model_2.evaluate([X_test_map['user_id'], X_test_map['movie_id']], y_test)
print(f'MAE: {loss_metrics_ncf_2[1]}')
print(f'RMSE: {loss_metrics_ncf_2[2]}')
625/625 [==============================] - 1s 1ms/step - loss: 0.7395 - mae: 0.7394 - root_mean_squared_error: 0.9469 MAE: 0.7393589019775391 RMSE: 0.9469417929649353
prediction = ncf_model_2.predict([X_test_map['user_id'], X_test_map['movie_id']], batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 3.823537 |
| 1 | 0 | 9 | 3 | 875693118 | 4.243023 |
| 2 | 0 | 11 | 5 | 878542960 | 4.879962 |
| 3 | 0 | 13 | 5 | 874965706 | 4.183180 |
| 4 | 0 | 16 | 3 | 875073198 | 3.458555 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.025693 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.967652 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.956453 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.460083 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.908199 |
20000 rows × 5 columns
mprk_ncf_2 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='NCF Model V2')
We will now try and increase the amount of hidden layers from 1 to 3 (20 neuron wide) with dropout separating layers in our model to increase the learning capabilities of our NCF MLP side of the model.
def create_ncf_model_2(num_users, num_movies, latent_dim, hidden_dim, loss_f, lr, dropout):
user_input = Input(shape=(1,), dtype='int32', name='user_input')
movie_input = Input(shape=(1,),dtype='int32', name='movie_input')
# Matrix factorization embedding layer
mf_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='mf_user_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
mf_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='mf_movie_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
mf_user_latent = Flatten(name='mf_user_latent')(mf_embedding_user(user_input))
mf_movie_latent = Flatten(name='mf_movie_latent')(mf_embedding_movie(movie_input))
# MLP embedding layer
mlp_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='mlp_user_embedding', input_length=1)
mlp_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='mlp_movie_embedding', input_length=1)
mlp_user_latent = Flatten(name='mlp_user_latent')(mf_embedding_user(user_input))
mlp_movie_latent = Flatten(name='mlp_movie_latent')(mf_embedding_movie(movie_input))
# Element-wise product of user and movie embeddings
gmf = Multiply(name='gmf_user_movie_element_wise_product')([mf_user_latent, mf_movie_latent])
# MLP concatination
mlp = Concatenate(name='mlp_concat')([mlp_user_latent, mlp_movie_latent])
# MLP hidden layers
drop1 = Dropout(0.1, name=f'drop1_{0.1}')(mlp)
hidden1 = Dense(hidden_dim, activation='relu')(drop1)
drop2 = Dropout(dropout, name=f'drop2_{dropout}')(hidden1)
hidden2 = Dense(hidden_dim, activation='relu')(drop2)
drop3 = Dropout(dropout, name=f'drop3_{dropout}')(hidden2)
hidden3 = Dense(hidden_dim, activation='relu')(drop3)
drop4 = Dropout(dropout, name=f'drop4_{dropout}')(hidden3)
# GMF & MLP concatination
ncf_concat = Concatenate(name='ncf_concat')([gmf, drop4])
x = Dense(1, activation='sigmoid', kernel_initializer='he_normal', name='prediction')(ncf_concat)
# Since we used sigmoid [0, 1] output from previous layer we reverse-normalize to our range [min_rating, max_rating]
pred = Lambda(lambda value: reverse_min_max_normalize(value, min_rating, max_rating), name='min_max_normalize')(x)
model = Model(inputs=[user_input, movie_input], outputs=pred) # , gender_input, age_input
model.compile(loss=loss_f, optimizer=Adam(lr=lr), metrics=['mae', root_mean_squared_error])
return model
ncf_model_3 = create_ncf_model_2(num_users=rating_user_count, num_movies=rated_movie_count, latent_dim=20, hidden_dim=20, loss_f='mae', lr=0.0001, dropout=0.5)
ncf_model_3.summary()
Model: "model_31"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
user_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
movie_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
mf_user_embedding (Embedding) (None, 1, 20) 18860 user_input[0][0]
user_input[0][0]
__________________________________________________________________________________________________
mf_movie_embedding (Embedding) (None, 1, 20) 33640 movie_input[0][0]
movie_input[0][0]
__________________________________________________________________________________________________
mlp_user_latent (Flatten) (None, 20) 0 mf_user_embedding[1][0]
__________________________________________________________________________________________________
mlp_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[1][0]
__________________________________________________________________________________________________
mlp_concat (Concatenate) (None, 40) 0 mlp_user_latent[0][0]
mlp_movie_latent[0][0]
__________________________________________________________________________________________________
drop1_0.1 (Dropout) (None, 40) 0 mlp_concat[0][0]
__________________________________________________________________________________________________
dense_39 (Dense) (None, 20) 820 drop1_0.1[0][0]
__________________________________________________________________________________________________
drop2_0.5 (Dropout) (None, 20) 0 dense_39[0][0]
__________________________________________________________________________________________________
dense_40 (Dense) (None, 20) 420 drop2_0.5[0][0]
__________________________________________________________________________________________________
drop3_0.5 (Dropout) (None, 20) 0 dense_40[0][0]
__________________________________________________________________________________________________
mf_user_latent (Flatten) (None, 20) 0 mf_user_embedding[0][0]
__________________________________________________________________________________________________
mf_movie_latent (Flatten) (None, 20) 0 mf_movie_embedding[0][0]
__________________________________________________________________________________________________
dense_41 (Dense) (None, 20) 420 drop3_0.5[0][0]
__________________________________________________________________________________________________
gmf_user_movie_element_wise_pro (None, 20) 0 mf_user_latent[0][0]
mf_movie_latent[0][0]
__________________________________________________________________________________________________
drop4_0.5 (Dropout) (None, 20) 0 dense_41[0][0]
__________________________________________________________________________________________________
ncf_concat (Concatenate) (None, 40) 0 gmf_user_movie_element_wise_produ
drop4_0.5[0][0]
__________________________________________________________________________________________________
prediction (Dense) (None, 1) 41 ncf_concat[0][0]
__________________________________________________________________________________________________
min_max_normalize (Lambda) (None, 1) 0 prediction[0][0]
==================================================================================================
Total params: 54,201
Trainable params: 54,201
Non-trainable params: 0
__________________________________________________________________________________________________
plot_model(ncf_model_3, "ncf_model_3.png", show_shapes=True)
train_time_callback_3, callbacks = create_callbacks('ncf_model_3')
ncf_model_3_history = ncf_model_3.fit([X_train_map['user_id'], X_train_map['movie_id']], y_train, epochs=25, validation_split=.1, verbose=1, callbacks=callbacks, batch_size=32)
Epoch 1/25 2250/2250 [==============================] - 7s 3ms/step - loss: 0.9765 - mae: 0.9765 - root_mean_squared_error: 1.1873 - val_loss: 0.8875 - val_mae: 0.8874 - val_root_mean_squared_error: 1.0408 Epoch 2/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.9191 - mae: 0.9190 - root_mean_squared_error: 1.1189 - val_loss: 0.8428 - val_mae: 0.8427 - val_root_mean_squared_error: 1.0060 Epoch 3/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8691 - mae: 0.8690 - root_mean_squared_error: 1.0699 - val_loss: 0.8017 - val_mae: 0.8017 - val_root_mean_squared_error: 0.9823 Epoch 4/25 2250/2250 [==============================] - 6s 3ms/step - loss: 0.8183 - mae: 0.8182 - root_mean_squared_error: 1.0320 - val_loss: 0.7918 - val_mae: 0.7918 - val_root_mean_squared_error: 0.9855 Epoch 5/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.7983 - mae: 0.7982 - root_mean_squared_error: 1.0209 - val_loss: 0.7922 - val_mae: 0.7921 - val_root_mean_squared_error: 0.9903 Epoch 6/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7912 - mae: 0.7911 - root_mean_squared_error: 1.0214 - val_loss: 0.7939 - val_mae: 0.7938 - val_root_mean_squared_error: 0.9952 Epoch 7/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7760 - mae: 0.7760 - root_mean_squared_error: 1.0122 - val_loss: 0.7946 - val_mae: 0.7945 - val_root_mean_squared_error: 0.9985 Epoch 8/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7740 - mae: 0.7739 - root_mean_squared_error: 1.0118 - val_loss: 0.7955 - val_mae: 0.7954 - val_root_mean_squared_error: 0.9993 Epoch 9/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7705 - mae: 0.7704 - root_mean_squared_error: 1.0079 - val_loss: 0.7962 - val_mae: 0.7962 - val_root_mean_squared_error: 1.0019 Epoch 10/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7675 - mae: 0.7675 - root_mean_squared_error: 1.0075 - val_loss: 0.7995 - val_mae: 0.7995 - val_root_mean_squared_error: 1.0057 Epoch 11/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.7594 - mae: 0.7593 - root_mean_squared_error: 1.0027 - val_loss: 0.7999 - val_mae: 0.7998 - val_root_mean_squared_error: 1.0061 Epoch 12/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7606 - mae: 0.7605 - root_mean_squared_error: 1.0045 - val_loss: 0.7995 - val_mae: 0.7994 - val_root_mean_squared_error: 1.0076 Epoch 13/25 2250/2250 [==============================] - 5s 2ms/step - loss: 0.7665 - mae: 0.7664 - root_mean_squared_error: 1.0077 - val_loss: 0.8020 - val_mae: 0.8019 - val_root_mean_squared_error: 1.0098 Epoch 14/25 2250/2250 [==============================] - 6s 2ms/step - loss: 0.7607 - mae: 0.7606 - root_mean_squared_error: 1.0047 - val_loss: 0.7990 - val_mae: 0.7989 - val_root_mean_squared_error: 1.0072 Total training time: 78.89499187469482 sec
plot_history(ncf_model_3_history, plot_title='NCF V3 model')
loss_metrics_ncf_3 = ncf_model_3.evaluate([X_test_map['user_id'], X_test_map['movie_id']], y_test)
print(f'MAE: {loss_metrics_ncf_3[1]}')
print(f'RMSE: {loss_metrics_ncf_3[2]}')
625/625 [==============================] - 1s 1ms/step - loss: 0.7855 - mae: 0.7854 - root_mean_squared_error: 1.0198 MAE: 0.7854375243186951 RMSE: 1.0197725296020508
prediction = ncf_model_3.predict([X_test_map['user_id'], X_test_map['movie_id']], batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 4.339095 |
| 1 | 0 | 9 | 3 | 875693118 | 4.568455 |
| 2 | 0 | 11 | 5 | 878542960 | 4.988988 |
| 3 | 0 | 13 | 5 | 874965706 | 4.396618 |
| 4 | 0 | 16 | 3 | 875073198 | 3.649216 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.133552 |
| 19996 | 457 | 1100 | 4 | 886397931 | 4.113035 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.999519 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.334958 |
| 19999 | 461 | 681 | 5 | 886365231 | 4.054100 |
20000 rows × 5 columns
mprk_ncf_3 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='NCF Model V3')
We will try to make to model 'wider' by increasing the amount of neurons from 20 to 64 in each of the 3 hidden layers in the MLP side of the model.
ncf_model_4 = create_ncf_model_2(num_users=rating_user_count, num_movies=rated_movie_count, latent_dim=64, hidden_dim=64, loss_f='mae', lr=0.0001, dropout=0.5)
ncf_model_4.summary()
Model: "model_32"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
user_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
movie_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
mf_user_embedding (Embedding) (None, 1, 64) 60352 user_input[0][0]
user_input[0][0]
__________________________________________________________________________________________________
mf_movie_embedding (Embedding) (None, 1, 64) 107648 movie_input[0][0]
movie_input[0][0]
__________________________________________________________________________________________________
mlp_user_latent (Flatten) (None, 64) 0 mf_user_embedding[1][0]
__________________________________________________________________________________________________
mlp_movie_latent (Flatten) (None, 64) 0 mf_movie_embedding[1][0]
__________________________________________________________________________________________________
mlp_concat (Concatenate) (None, 128) 0 mlp_user_latent[0][0]
mlp_movie_latent[0][0]
__________________________________________________________________________________________________
drop1_0.1 (Dropout) (None, 128) 0 mlp_concat[0][0]
__________________________________________________________________________________________________
dense_42 (Dense) (None, 64) 8256 drop1_0.1[0][0]
__________________________________________________________________________________________________
drop2_0.5 (Dropout) (None, 64) 0 dense_42[0][0]
__________________________________________________________________________________________________
dense_43 (Dense) (None, 64) 4160 drop2_0.5[0][0]
__________________________________________________________________________________________________
drop3_0.5 (Dropout) (None, 64) 0 dense_43[0][0]
__________________________________________________________________________________________________
mf_user_latent (Flatten) (None, 64) 0 mf_user_embedding[0][0]
__________________________________________________________________________________________________
mf_movie_latent (Flatten) (None, 64) 0 mf_movie_embedding[0][0]
__________________________________________________________________________________________________
dense_44 (Dense) (None, 64) 4160 drop3_0.5[0][0]
__________________________________________________________________________________________________
gmf_user_movie_element_wise_pro (None, 64) 0 mf_user_latent[0][0]
mf_movie_latent[0][0]
__________________________________________________________________________________________________
drop4_0.5 (Dropout) (None, 64) 0 dense_44[0][0]
__________________________________________________________________________________________________
ncf_concat (Concatenate) (None, 128) 0 gmf_user_movie_element_wise_produ
drop4_0.5[0][0]
__________________________________________________________________________________________________
prediction (Dense) (None, 1) 129 ncf_concat[0][0]
__________________________________________________________________________________________________
min_max_normalize (Lambda) (None, 1) 0 prediction[0][0]
==================================================================================================
Total params: 184,705
Trainable params: 184,705
Non-trainable params: 0
__________________________________________________________________________________________________
plot_model(ncf_model_4, "ncf_model_4.png", show_shapes=True)
train_time_callback_4, callbacks = create_callbacks('ncf_model_4')
ncf_model_4_history = ncf_model_4.fit([X_train_map['user_id'], X_train_map['movie_id']], y_train, epochs=25, validation_split=.1, verbose=1, callbacks=callbacks, batch_size=32)
Epoch 1/25 2250/2250 [==============================] - 11s 4ms/step - loss: 0.9688 - mae: 0.9685 - root_mean_squared_error: 1.1776 - val_loss: 0.8410 - val_mae: 0.8407 - val_root_mean_squared_error: 1.0099 Epoch 2/25 2250/2250 [==============================] - 10s 4ms/step - loss: 0.8315 - mae: 0.8313 - root_mean_squared_error: 1.0392 - val_loss: 0.7938 - val_mae: 0.7936 - val_root_mean_squared_error: 0.9850 Epoch 3/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7728 - mae: 0.7725 - root_mean_squared_error: 0.9890 - val_loss: 0.7899 - val_mae: 0.7897 - val_root_mean_squared_error: 0.9836 Epoch 4/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7590 - mae: 0.7588 - root_mean_squared_error: 0.9785 - val_loss: 0.7910 - val_mae: 0.7907 - val_root_mean_squared_error: 0.9836 Epoch 5/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7522 - mae: 0.7519 - root_mean_squared_error: 0.9756 - val_loss: 0.7895 - val_mae: 0.7892 - val_root_mean_squared_error: 0.9837 Epoch 6/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7394 - mae: 0.7392 - root_mean_squared_error: 0.9612 - val_loss: 0.7875 - val_mae: 0.7873 - val_root_mean_squared_error: 0.9816 Epoch 7/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7346 - mae: 0.7343 - root_mean_squared_error: 0.9526 - val_loss: 0.7865 - val_mae: 0.7863 - val_root_mean_squared_error: 0.9772 Epoch 8/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7222 - mae: 0.7220 - root_mean_squared_error: 0.9427 - val_loss: 0.7863 - val_mae: 0.7861 - val_root_mean_squared_error: 0.9796 Epoch 9/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7207 - mae: 0.7205 - root_mean_squared_error: 0.9448 - val_loss: 0.7867 - val_mae: 0.7865 - val_root_mean_squared_error: 0.9801 Epoch 10/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7139 - mae: 0.7137 - root_mean_squared_error: 0.9365 - val_loss: 0.7861 - val_mae: 0.7859 - val_root_mean_squared_error: 0.9792 Epoch 11/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7061 - mae: 0.7058 - root_mean_squared_error: 0.9319 - val_loss: 0.7862 - val_mae: 0.7859 - val_root_mean_squared_error: 0.9770 Epoch 12/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7079 - mae: 0.7076 - root_mean_squared_error: 0.9345 - val_loss: 0.7838 - val_mae: 0.7835 - val_root_mean_squared_error: 0.9758 Epoch 13/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6999 - mae: 0.6996 - root_mean_squared_error: 0.9265 - val_loss: 0.7854 - val_mae: 0.7851 - val_root_mean_squared_error: 0.9780 Epoch 14/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6944 - mae: 0.6941 - root_mean_squared_error: 0.9211 - val_loss: 0.7838 - val_mae: 0.7835 - val_root_mean_squared_error: 0.9787 Epoch 15/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6970 - mae: 0.6967 - root_mean_squared_error: 0.9296 - val_loss: 0.7852 - val_mae: 0.7849 - val_root_mean_squared_error: 0.9787 Epoch 16/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6897 - mae: 0.6894 - root_mean_squared_error: 0.9201 - val_loss: 0.7854 - val_mae: 0.7851 - val_root_mean_squared_error: 0.9801 Epoch 17/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6849 - mae: 0.6846 - root_mean_squared_error: 0.9187 - val_loss: 0.7851 - val_mae: 0.7848 - val_root_mean_squared_error: 0.9810 Epoch 18/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6828 - mae: 0.6825 - root_mean_squared_error: 0.9143 - val_loss: 0.7867 - val_mae: 0.7864 - val_root_mean_squared_error: 0.9848 Epoch 19/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6797 - mae: 0.6794 - root_mean_squared_error: 0.9143 - val_loss: 0.7880 - val_mae: 0.7877 - val_root_mean_squared_error: 0.9859 Epoch 20/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6746 - mae: 0.6743 - root_mean_squared_error: 0.9091 - val_loss: 0.7876 - val_mae: 0.7873 - val_root_mean_squared_error: 0.9854 Epoch 21/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6732 - mae: 0.6729 - root_mean_squared_error: 0.9091 - val_loss: 0.7895 - val_mae: 0.7892 - val_root_mean_squared_error: 0.9888 Epoch 22/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6730 - mae: 0.6727 - root_mean_squared_error: 0.9120 - val_loss: 0.7890 - val_mae: 0.7887 - val_root_mean_squared_error: 0.9889 Epoch 23/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6698 - mae: 0.6695 - root_mean_squared_error: 0.9098 - val_loss: 0.7914 - val_mae: 0.7911 - val_root_mean_squared_error: 0.9916 Epoch 24/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6646 - mae: 0.6642 - root_mean_squared_error: 0.9033 - val_loss: 0.7905 - val_mae: 0.7901 - val_root_mean_squared_error: 0.9899 Total training time: 223.5380778312683 sec
plot_history(ncf_model_4_history, plot_title='NCF V4 model')
loss_metrics_ncf_4 = ncf_model_4.evaluate([X_test_map['user_id'], X_test_map['movie_id']], y_test)
print(f'MAE: {loss_metrics_ncf_4[1]}')
print(f'RMSE: {loss_metrics_ncf_4[2]}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7260 - mae: 0.7257 - root_mean_squared_error: 0.9612 MAE: 0.7257158160209656 RMSE: 0.9612079858779907
prediction = ncf_model_4.predict([X_test_map['user_id'], X_test_map['movie_id']], batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 4.493304 |
| 1 | 0 | 9 | 3 | 875693118 | 4.193358 |
| 2 | 0 | 11 | 5 | 878542960 | 4.958766 |
| 3 | 0 | 13 | 5 | 874965706 | 4.445634 |
| 4 | 0 | 16 | 3 | 875073198 | 3.100778 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 3.963994 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.848687 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.999002 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.549572 |
| 19999 | 461 | 681 | 5 | 886365231 | 4.190098 |
20000 rows × 5 columns
mprk_ncf_4 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='NCF Model V4')
plot_model_loss_comparison(losses={'Baseline All': {'mae': basemodel_loss["MAE"], 'rmse': basemodel_loss["RMSE"]},
'TuriCreate Matrix Facorization': {'mae': mae(test_data["rating"], y_pred_mf), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_mf))},
'NCF Model V1 ((1, 20) hidden, MSE loss, 0.001 LR)': {'mae': loss_metrics_ncf_1[1], 'rmse': loss_metrics_ncf_1[2]},
'NCF Model V2 ((1, 20) hidden, MAE loss, 0.0001 LR)': {'mae': loss_metrics_ncf_2[1], 'rmse': loss_metrics_ncf_2[2]},
'NCF Model V3 ((3, 20) hidden, MAE loss, 0.0001 LR)': {'mae': loss_metrics_ncf_3[1], 'rmse': loss_metrics_ncf_3[2]},
'NCF Model V4 ((3, 64) hidden, MAE loss, 0.0001 LR)': {'mae': loss_metrics_ncf_4[1], 'rmse': loss_metrics_ncf_4[2]},
},
plot_title='Keras NCF models comparison', high=True)
plot_model_train_time_comparison(times={'NCF Model V1': train_time_callback_1.train_time_end,
'NCF Model V2': train_time_callback_2.train_time_end,
'NCF Model V3': train_time_callback_3.train_time_end,
'NCF Model V4': train_time_callback_4.train_time_end}, plot_title='Keras NCF Training time comparison')
plot_model_precision_recall_at_k_comparison(data={
'Baseline All': mprk_baseline_all,
'TuriCreate Matrix Facorization': mprk_mf,
'NCF Model V1': mprk_ncf_1,
'NCF Model V2': mprk_ncf_2,
'NCF Model V3': mprk_ncf_3,
'NCF Model V4': mprk_ncf_4,
},
plot_title='Keras NCF', wide=True)
DeepFM, combines the power of factorization machines for recommendation and deep learning for feature learning in a new neural network architecture.
Compared to the latest Wide \& Deep model from Google, DeepFM has a shared input to its "wide" and "deep" parts, with no need of feature engineering besides raw features.
DeepFM consists of an FM component and a deep component which are integrated in a parallel structure. The FM component is the same as the 2-way factorization machines which is used to model the low-order feature interactions. The deep component is a multi-layered perceptron that is used to capture high-order feature interactions and nonlinearities. These two components share the same inputs/embeddings and their outputs are summed up as the final prediction.
The advantages of DeepFM over the Wide & Deep model is that it reduces the effort of hand-crafted feature engineering by identifying feature combinations automatically.
In order to feed the model with additional features we need to first preprocess them. We will use the following features from each user:
We will use the following features in regards to each movie:
def text2seq(text, n_genre, vec_size):
""" using tokenizer to encoded the multi-level categorical feature"""
tokenizer = Tokenizer(lower=True, split='|',filters='', num_words=n_genre)
tokenizer.fit_on_texts(text)
seq = tokenizer.texts_to_sequences(text)
seq = pad_sequences(seq, maxlen=vec_size, padding='post')
return seq
Extract the movie genres from binary column into a list of genres
num_generes = len(genre_cols)
condense_movies = movies.apply(lambda row: condense_binary_row(row, 'genres', genre_cols), axis=1)[['movie_id', 'genres']]
max_movie_genres = max(condense_movies['genres'].str.len())
movies_details = pd.merge(pd.Series(X_train_map['movie_id'], name='movie_id'), condense_movies, on='movie_id')
movies_details
| movie_id | genres | |
|---|---|---|
| 0 | 0 | [Animation, Children, Comedy] |
| 1 | 0 | [Animation, Children, Comedy] |
| 2 | 0 | [Animation, Children, Comedy] |
| 3 | 0 | [Animation, Children, Comedy] |
| 4 | 0 | [Animation, Children, Comedy] |
| ... | ... | ... |
| 79995 | 1678 | [Romance, Thriller] |
| 79996 | 1679 | [Drama, Romance] |
| 79997 | 906 | [Comedy] |
| 79998 | 1680 | [Comedy] |
| 79999 | 1681 | [Drama] |
80000 rows × 2 columns
Convert the genres for each movie to a sequence representing the genres for each movie.
condense_movies['genres'] = text2seq(condense_movies['genres'].values, n_genre=num_generes, vec_size=max_movie_genres).tolist()
condense_movies
| movie_id | genres | |
|---|---|---|
| 0 | 0 | [15, 7, 2, 0, 0, 0] |
| 1 | 1 | [3, 6, 4, 0, 0, 0] |
| 2 | 2 | [4, 0, 0, 0, 0, 0] |
| 3 | 3 | [3, 2, 1, 0, 0, 0] |
| 4 | 4 | [8, 1, 4, 0, 0, 0] |
| ... | ... | ... |
| 1677 | 1677 | [1, 0, 0, 0, 0, 0] |
| 1678 | 1678 | [5, 4, 0, 0, 0, 0] |
| 1679 | 1679 | [1, 5, 0, 0, 0, 0] |
| 1680 | 1680 | [2, 0, 0, 0, 0, 0] |
| 1681 | 1681 | [1, 0, 0, 0, 0, 0] |
1682 rows × 2 columns
Add users gender and age features to our training and testing data. (With conversion of gender 'M' and 'F' to 1 and 0)
movie_details = pd.merge(pd.Series(X_train_map['movie_id'], name='movie_id'), condense_movies, on='movie_id')
X_train_map['genres'] = np.array([np.array(x) for x in movie_details['genres'].values])
movie_details = pd.merge(pd.Series(X_test_map['movie_id'], name='movie_id'), condense_movies, on='movie_id')
X_test_map['genres'] = np.array([np.array(x) for x in movie_details['genres'].values])
users_details = pd.merge(pd.Series(X_train_map['user_id'], name='user_id'), users, on='user_id')
X_train_map['gender'] = users_details['sex'].apply(lambda g: 1 if g == 'M' else 0).values.astype(np.int32)
X_train_map['age'] = users_details['age'].values.astype(np.int32)
users_details = pd.merge(pd.Series(X_test_map['user_id'], name='user_id'), users, on='user_id')
X_test_map['gender'] = users_details['sex'].apply(lambda g: 1 if g == 'M' else 0).values.astype(np.int32)
X_test_map['age'] = users_details['age'].values.astype(np.int32)
max_age = users['age'].max() + 1
print(f'Train data shapes: user_id: {X_train_map["user_id"].shape}, movie_id: {X_train_map["movie_id"].shape}, gender: {X_train_map["gender"].shape}, age: {X_train_map["age"].shape}, genres: {X_train_map["genres"].shape}')
print(f'Test data shapes: user_id: {X_test_map["user_id"].shape}, movie_id: {X_test_map["movie_id"].shape}, gender: {X_test_map["gender"].shape}, age: {X_test_map["age"].shape}, genres: {X_test_map["genres"].shape}')
Train data shapes: user_id: (80000,), movie_id: (80000,), gender: (80000,), age: (80000,), genres: (80000, 6) Test data shapes: user_id: (20000,), movie_id: (20000,), gender: (20000,), age: (20000,), genres: (20000, 6)
numeric features with shape (None, 1) => dense layer
categorical features (single level) with shape (None,1) => embedding layer (latent_dim = 1)
categorical features (multi level) with shape (None,L) => embedding layer (latent_dim = 1)
output will summation of all embedded features
numeric features => dense layer
categorical features (single level) => embedding layer (latent_dim = k)
categorical features (multi level) with shape (None,L) => embedding layer (latent_dim = k)
shared embed layer will be the concatenated layers of all embedded features
shared embed layer => dot layer => 2nd order of FM part
1st order will require features to map to a scalar.
def Tensor_Mean_Pooling(name='mean_pooling', keepdims=False):
return Lambda(lambda x: K.mean(x, axis = 1, keepdims=keepdims), name=name)
def fm_add(inputs, num_users, num_movies, latent_dim, num_genres, max_age):
user_input, movie_input, genre_input, gender_input, age_input = inputs
fm_1d_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='fm_1d_user_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
fm_1d_user_latent = Flatten(name='fm_1d_user_latent')(fm_1d_embedding_user(user_input))
fm_1d_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='fm_1d_movie_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
fm_1d_movie_latent = Flatten(name='fm_1d_movie_latent')(fm_1d_embedding_movie(movie_input))
cat_genre_embed_1d = Embedding(input_dim=num_genres + 1, output_dim=latent_dim, mask_zero=True, name='cat_embed_1d_genre', input_length=max_movie_genres, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(genre_input)
fm_1d_genre_mean = Tensor_Mean_Pooling(name='fm_1d_genre_mean')(cat_genre_embed_1d)
fm_1d_embedding_gender = Embedding(input_dim=2, output_dim=latent_dim, name='fm_1d_gender_embedding', mask_zero=False, input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
fm_1d_gender_latent = Flatten(name='fm_1d_gender_latent')(fm_1d_embedding_gender(gender_input))
fm_1d_embedding_age = Embedding(input_dim=max_age, output_dim=latent_dim, name='fm_1d_age_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))
fm_1d_age_latent = Flatten(name='fm_1d_age_latent')(fm_1d_embedding_age(age_input))
# Add tensors to produce scalar value
y_fm_1d = Add(name = 'fm_1d_output')([fm_1d_user_latent, fm_1d_movie_latent, fm_1d_gender_latent, fm_1d_age_latent, fm_1d_genre_mean])
return y_fm_1d
k – matrix factorization latent dimension, p is feature dimension.
the calculation of interaction terms can be simplified, using:
def fm_interactions(inputs, num_users, num_movies, latent_dim, num_genres, max_age):
user_input, movie_input, genre_input, gender_input, age_input = inputs
fm_2d_embedding_user = Embedding(input_dim=num_users, output_dim=latent_dim, name='fm_2d_user_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(user_input)
fm_2d_embedding_movie = Embedding(input_dim=num_movies, output_dim=latent_dim, name='fm_2d_movie_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(movie_input)
cat_genre_embed_2d = Embedding(input_dim=num_genres + 1, output_dim=latent_dim, mask_zero=True, name='cat_embed_2d_genre', input_length=max_movie_genres, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(genre_input)
fm_2d_genre_mean = Tensor_Mean_Pooling(name='fm_2d_genre_mean', keepdims=True)(cat_genre_embed_2d)
fm_2d_embedding_gender = Embedding(input_dim=2, output_dim=latent_dim, name='fm_2d_gender_embedding', mask_zero=False, input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(gender_input)
fm_2d_embedding_age = Embedding(input_dim=max_age, output_dim=latent_dim, name='fm_2d_age_embedding', input_length=1, embeddings_initializer='he_normal', embeddings_regularizer=l2(1e-6))(age_input)
# Concatenate all 2d embed layers => (None, 1, latent_dim)
embed_2d = Concatenate(axis=1, name = 'concat_embed_2d')([fm_2d_embedding_user, fm_2d_embedding_movie, fm_2d_genre_mean, fm_2d_embedding_gender, fm_2d_embedding_age])
# calcuate the interactions by simplication
# sum of (x1*x2) = sum of (0.5*[(xi)^2 - (xi^2)])
sum_of_embed = Lambda(lambda x: K.sum(x, axis=1), name = 'embed_2d_sum')(embed_2d)
square_of_sum = Multiply(name='square_of_sum')([sum_of_embed, sum_of_embed])
square_of_embed = Multiply(name='square_of_embed')([embed_2d, embed_2d])
sum_of_square = Lambda(lambda x: K.sum(x, axis=1), name = 'embed_2d_square_embed_sum')(square_of_embed)
sub = Subtract(name='sub_square_of_sum_sum_of_square')([square_of_sum, sum_of_square])
half = Lambda(lambda x: x*0.5, name='divide_by_2')(sub)
sum_of_half = Lambda(lambda x: K.sum(x, axis=1), name = 'embed_2d_half_sum')(half)
y_fm_2d = Flatten(name = 'fm_2d_output')(sum_of_half)
return y_fm_2d, embed_2d
def dnn(embed_2d, dnn_dims, dropout):
# Flat embed layers from 3D to 2D tensors
y_dnn = Flatten(name = 'flat_embed_2d')(embed_2d)
for hidden_dim in dnn_dims:
y_dnn = Dropout(dropout)(y_dnn)
y_dnn = Dense(hidden_dim, activation='relu')(y_dnn)
y_dnn = Dropout(dropout)(y_dnn)
y_dnn = Dense(1, activation='relu')(y_dnn)
y_dnn = Flatten(name = 'deep_output')(y_dnn)
return y_dnn
def deepfm_model(num_users, num_movies, num_genres, max_age, latent_dim, dnn_dims, dropout):
user_input = Input(shape=(1,), dtype='int32', name='user_input')
movie_input = Input(shape=(1,),dtype='int32', name='movie_input')
genre_input = Input(shape=(max_movie_genres,), name='genre_input')
gender_input = Input(shape=(1,),dtype='int32', name='gender_input')
age_input = Input(shape=(1,),dtype='int32', name='age_input')
inputs = [user_input, movie_input, genre_input, gender_input, age_input]
y_fm_1d = fm_add(inputs, num_users=num_users, num_movies=num_movies, latent_dim=1, num_genres=num_genres, max_age=max_age)
y_fm_2d, embed_2d = fm_interactions(inputs, num_users=num_users, num_movies=num_movies, latent_dim=latent_dim, num_genres=num_generes, max_age=max_age)
y_dnn = dnn(embed_2d, dnn_dims, dropout)
# Combined deep and fm parts
y = Concatenate()([y_fm_1d, y_fm_2d, y_dnn])
y = Dense(1, activation='sigmoid', name = 'deepfm_output')(y)
pred = Lambda(lambda value: reverse_min_max_normalize(value, min_rating, max_rating), name='min_max_normalize')(y)
fm_model_1d = Model(inputs, y_fm_1d)
fm_model_2d = Model(inputs, y_fm_2d)
deep_model = Model(inputs, y_dnn)
deepfm_model = Model(inputs, pred)
return fm_model_1d, fm_model_2d, deep_model, deepfm_model
fm_model_1d, fm_model_2d, deep_model, deepfm_model = deepfm_model(num_users=rating_user_count, num_movies=rated_movie_count, num_genres=num_generes, max_age=max_age, latent_dim=20, dnn_dims=[32, 32, 32], dropout=0.5)
deepfm_model.summary()
Model: "model_49"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
genre_input (InputLayer) [(None, 6)] 0
__________________________________________________________________________________________________
user_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
movie_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
gender_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
age_input (InputLayer) [(None, 1)] 0
__________________________________________________________________________________________________
cat_embed_2d_genre (Embedding) (None, 6, 20) 400 genre_input[0][0]
__________________________________________________________________________________________________
fm_2d_user_embedding (Embedding (None, 1, 20) 18860 user_input[0][0]
__________________________________________________________________________________________________
fm_2d_movie_embedding (Embeddin (None, 1, 20) 33640 movie_input[0][0]
__________________________________________________________________________________________________
fm_2d_genre_mean (Lambda) (None, 1, 20) 0 cat_embed_2d_genre[0][0]
__________________________________________________________________________________________________
fm_2d_gender_embedding (Embeddi (None, 1, 20) 40 gender_input[0][0]
__________________________________________________________________________________________________
fm_2d_age_embedding (Embedding) (None, 1, 20) 1480 age_input[0][0]
__________________________________________________________________________________________________
concat_embed_2d (Concatenate) (None, 5, 20) 0 fm_2d_user_embedding[0][0]
fm_2d_movie_embedding[0][0]
fm_2d_genre_mean[0][0]
fm_2d_gender_embedding[0][0]
fm_2d_age_embedding[0][0]
__________________________________________________________________________________________________
flat_embed_2d (Flatten) (None, 100) 0 concat_embed_2d[0][0]
__________________________________________________________________________________________________
dropout_18 (Dropout) (None, 100) 0 flat_embed_2d[0][0]
__________________________________________________________________________________________________
dense_58 (Dense) (None, 32) 3232 dropout_18[0][0]
__________________________________________________________________________________________________
dropout_19 (Dropout) (None, 32) 0 dense_58[0][0]
__________________________________________________________________________________________________
embed_2d_sum (Lambda) (None, 20) 0 concat_embed_2d[0][0]
__________________________________________________________________________________________________
square_of_embed (Multiply) (None, 5, 20) 0 concat_embed_2d[0][0]
concat_embed_2d[0][0]
__________________________________________________________________________________________________
dense_59 (Dense) (None, 32) 1056 dropout_19[0][0]
__________________________________________________________________________________________________
square_of_sum (Multiply) (None, 20) 0 embed_2d_sum[0][0]
embed_2d_sum[0][0]
__________________________________________________________________________________________________
embed_2d_square_embed_sum (Lamb (None, 20) 0 square_of_embed[0][0]
__________________________________________________________________________________________________
dropout_20 (Dropout) (None, 32) 0 dense_59[0][0]
__________________________________________________________________________________________________
sub_square_of_sum_sum_of_square (None, 20) 0 square_of_sum[0][0]
embed_2d_square_embed_sum[0][0]
__________________________________________________________________________________________________
dense_60 (Dense) (None, 32) 1056 dropout_20[0][0]
__________________________________________________________________________________________________
fm_1d_user_embedding (Embedding (None, 1, 1) 943 user_input[0][0]
__________________________________________________________________________________________________
fm_1d_movie_embedding (Embeddin (None, 1, 1) 1682 movie_input[0][0]
__________________________________________________________________________________________________
fm_1d_gender_embedding (Embeddi (None, 1, 1) 2 gender_input[0][0]
__________________________________________________________________________________________________
fm_1d_age_embedding (Embedding) (None, 1, 1) 74 age_input[0][0]
__________________________________________________________________________________________________
cat_embed_1d_genre (Embedding) (None, 6, 1) 20 genre_input[0][0]
__________________________________________________________________________________________________
divide_by_2 (Lambda) (None, 20) 0 sub_square_of_sum_sum_of_square[0
__________________________________________________________________________________________________
dropout_21 (Dropout) (None, 32) 0 dense_60[0][0]
__________________________________________________________________________________________________
fm_1d_user_latent (Flatten) (None, 1) 0 fm_1d_user_embedding[0][0]
__________________________________________________________________________________________________
fm_1d_movie_latent (Flatten) (None, 1) 0 fm_1d_movie_embedding[0][0]
__________________________________________________________________________________________________
fm_1d_gender_latent (Flatten) (None, 1) 0 fm_1d_gender_embedding[0][0]
__________________________________________________________________________________________________
fm_1d_age_latent (Flatten) (None, 1) 0 fm_1d_age_embedding[0][0]
__________________________________________________________________________________________________
fm_1d_genre_mean (Lambda) (None, 1) 0 cat_embed_1d_genre[0][0]
__________________________________________________________________________________________________
embed_2d_half_sum (Lambda) (None,) 0 divide_by_2[0][0]
__________________________________________________________________________________________________
dense_61 (Dense) (None, 1) 33 dropout_21[0][0]
__________________________________________________________________________________________________
fm_1d_output (Add) (None, 1) 0 fm_1d_user_latent[0][0]
fm_1d_movie_latent[0][0]
fm_1d_gender_latent[0][0]
fm_1d_age_latent[0][0]
fm_1d_genre_mean[0][0]
__________________________________________________________________________________________________
fm_2d_output (Flatten) (None, 1) 0 embed_2d_half_sum[0][0]
__________________________________________________________________________________________________
deep_output (Flatten) (None, 1) 0 dense_61[0][0]
__________________________________________________________________________________________________
concatenate_47 (Concatenate) (None, 3) 0 fm_1d_output[0][0]
fm_2d_output[0][0]
deep_output[0][0]
__________________________________________________________________________________________________
deepfm_output (Dense) (None, 1) 4 concatenate_47[0][0]
__________________________________________________________________________________________________
min_max_normalize (Lambda) (None, 1) 0 deepfm_output[0][0]
==================================================================================================
Total params: 62,522
Trainable params: 62,522
Non-trainable params: 0
__________________________________________________________________________________________________
plot_model(fm_model_1d, "deepfm_1d.png", show_shapes=True)
plot_model(fm_model_2d, "deepfm_2d.png", show_shapes=True)
DNN part plot
plot_model(deep_model, "deepfm_dnn.png", show_shapes=True)
DeepFM model full plot
plot_model(deepfm_model, "deepfm.png", show_shapes=True)
deepfm_model.compile(loss='mae', optimizer=Adam(lr=0.0001), metrics=['mae', root_mean_squared_error])
train_time_callback_1, callbacks = create_callbacks('deepfm_model')
deepfm_model_history = deepfm_model.fit([X_train_map['user_id'], X_train_map['movie_id'], X_train_map['genres'], X_train_map['gender'], X_train_map['age']], y_train, epochs=25, validation_split=.1, verbose=1, callbacks=callbacks, batch_size=32)
Epoch 1/25 2250/2250 [==============================] - 10s 4ms/step - loss: 1.0308 - mae: 1.0306 - root_mean_squared_error: 1.2601 - val_loss: 0.8729 - val_mae: 0.8727 - val_root_mean_squared_error: 1.0791 Epoch 2/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.8578 - mae: 0.8576 - root_mean_squared_error: 1.0818 - val_loss: 0.8476 - val_mae: 0.8474 - val_root_mean_squared_error: 1.0437 Epoch 3/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7984 - mae: 0.7981 - root_mean_squared_error: 1.0137 - val_loss: 0.8314 - val_mae: 0.8311 - val_root_mean_squared_error: 1.0213 Epoch 4/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7653 - mae: 0.7651 - root_mean_squared_error: 0.9791 - val_loss: 0.8207 - val_mae: 0.8205 - val_root_mean_squared_error: 1.0069 Epoch 5/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7438 - mae: 0.7436 - root_mean_squared_error: 0.9536 - val_loss: 0.8155 - val_mae: 0.8153 - val_root_mean_squared_error: 1.0008 Epoch 6/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7306 - mae: 0.7304 - root_mean_squared_error: 0.9408 - val_loss: 0.8152 - val_mae: 0.8149 - val_root_mean_squared_error: 1.0014 Epoch 7/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7240 - mae: 0.7237 - root_mean_squared_error: 0.9356 - val_loss: 0.8172 - val_mae: 0.8169 - val_root_mean_squared_error: 1.0044 Epoch 8/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7113 - mae: 0.7111 - root_mean_squared_error: 0.9217 - val_loss: 0.8174 - val_mae: 0.8171 - val_root_mean_squared_error: 1.0048 Epoch 9/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7096 - mae: 0.7093 - root_mean_squared_error: 0.9220 - val_loss: 0.8149 - val_mae: 0.8146 - val_root_mean_squared_error: 1.0015 Epoch 10/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7049 - mae: 0.7047 - root_mean_squared_error: 0.9177 - val_loss: 0.8170 - val_mae: 0.8167 - val_root_mean_squared_error: 1.0054 Epoch 11/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7073 - mae: 0.7071 - root_mean_squared_error: 0.9230 - val_loss: 0.8154 - val_mae: 0.8151 - val_root_mean_squared_error: 1.0037 Epoch 12/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.7053 - mae: 0.7050 - root_mean_squared_error: 0.9192 - val_loss: 0.8180 - val_mae: 0.8177 - val_root_mean_squared_error: 1.0076 Epoch 13/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6972 - mae: 0.6969 - root_mean_squared_error: 0.9139 - val_loss: 0.8172 - val_mae: 0.8169 - val_root_mean_squared_error: 1.0073 Epoch 14/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.6907 - mae: 0.6904 - root_mean_squared_error: 0.9092 - val_loss: 0.8178 - val_mae: 0.8175 - val_root_mean_squared_error: 1.0082 Epoch 15/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.6950 - mae: 0.6947 - root_mean_squared_error: 0.9130 - val_loss: 0.8184 - val_mae: 0.8181 - val_root_mean_squared_error: 1.0093 Epoch 16/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.6919 - mae: 0.6916 - root_mean_squared_error: 0.9106 - val_loss: 0.8185 - val_mae: 0.8182 - val_root_mean_squared_error: 1.0095 Epoch 17/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6930 - mae: 0.6926 - root_mean_squared_error: 0.9122 - val_loss: 0.8190 - val_mae: 0.8187 - val_root_mean_squared_error: 1.0111 Epoch 18/25 2250/2250 [==============================] - 8s 3ms/step - loss: 0.6889 - mae: 0.6885 - root_mean_squared_error: 0.9097 - val_loss: 0.8196 - val_mae: 0.8193 - val_root_mean_squared_error: 1.0127 Epoch 19/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6914 - mae: 0.6911 - root_mean_squared_error: 0.9140 - val_loss: 0.8192 - val_mae: 0.8189 - val_root_mean_squared_error: 1.0117 Total training time: 151.99857664108276 sec
plot_history(deepfm_model_history, plot_title='Keras DeepFM (Gender, Age, Generes)')
loss_metrics = deepfm_model.evaluate([X_test_map['user_id'], X_test_map['movie_id'], X_test_map['genres'], X_test_map['gender'], X_test_map['age']], y_test)
_, deepfm_model_mae, deepfm_model_rmse = loss_metrics
print(f'MAE: {deepfm_model_mae}')
print(f'RMSE: {deepfm_model_rmse}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7534 - mae: 0.7530 - root_mean_squared_error: 0.9631 MAE: 0.7530499696731567 RMSE: 0.9630998969078064
prediction = deepfm_model.predict([X_test_map['user_id'], X_test_map['movie_id'], X_test_map['genres'], X_test_map['gender'], X_test_map['age']], batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 3.605712 |
| 1 | 0 | 9 | 3 | 875693118 | 4.173556 |
| 2 | 0 | 11 | 5 | 878542960 | 4.822255 |
| 3 | 0 | 13 | 5 | 874965706 | 4.176426 |
| 4 | 0 | 16 | 3 | 875073198 | 3.423254 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 3.942675 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.942984 |
| 19997 | 458 | 933 | 3 | 879563639 | 2.638029 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.205859 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.969927 |
20000 rows × 5 columns
mprk_deepfm_keras = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='DeepFM Model Keras')
def create_deepfm_ctr_model(fixlen_feature_columns, varlen_feature_columns):
linear_feature_columns = fixlen_feature_columns + varlen_feature_columns
dnn_feature_columns = fixlen_feature_columns + varlen_feature_columns
feature_names = get_feature_names(linear_feature_columns + dnn_feature_columns)
model = DeepFM(linear_feature_columns, dnn_feature_columns, task='regression')
model.compile(loss='mae', optimizer=Adam(lr=0.0001), metrics=['mae', root_mean_squared_error])
return model, feature_names
fixlen_feature_columns = [SparseFeat("user_id", rating_user_count, embedding_dim=20),
SparseFeat("movie_id", rated_movie_count, embedding_dim=20),
SparseFeat("gender", 2, embedding_dim=20),
SparseFeat("age", max_age, embedding_dim=20)]
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres',vocabulary_size=num_generes, embedding_dim=20), maxlen=max_movie_genres, combiner='mean', weight_name=None)] # Notice : value 0 is for padding for sequence input feature
deepfm_ctr_model_1, feature_names = create_deepfm_ctr_model(fixlen_feature_columns, varlen_feature_columns)
train_input = {name:X_train_map[name] for name in feature_names}
train_time_callback_2, callbacks = create_callbacks('deepfm_ctr_model')
deepfm_ctr_model_1_history = deepfm_ctr_model_1.fit(train_input, y_train, batch_size=32, epochs=25, verbose=1, validation_split=0.1, callbacks=callbacks )
Epoch 1/25 2250/2250 [==============================] - 11s 4ms/step - loss: 1.4657 - mae: 1.4656 - root_mean_squared_error: 1.6716 - val_loss: 0.8260 - val_mae: 0.8259 - val_root_mean_squared_error: 0.9961 Epoch 2/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7343 - mae: 0.7342 - root_mean_squared_error: 0.9411 - val_loss: 0.8191 - val_mae: 0.8190 - val_root_mean_squared_error: 0.9903 Epoch 3/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7234 - mae: 0.7233 - root_mean_squared_error: 0.9293 - val_loss: 0.8077 - val_mae: 0.8076 - val_root_mean_squared_error: 0.9855 Epoch 4/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7245 - mae: 0.7244 - root_mean_squared_error: 0.9330 - val_loss: 0.8098 - val_mae: 0.8097 - val_root_mean_squared_error: 0.9846 Epoch 5/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7184 - mae: 0.7184 - root_mean_squared_error: 0.9285 - val_loss: 0.8066 - val_mae: 0.8065 - val_root_mean_squared_error: 0.9828 Epoch 6/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7187 - mae: 0.7186 - root_mean_squared_error: 0.9301 - val_loss: 0.7953 - val_mae: 0.7952 - val_root_mean_squared_error: 0.9777 Epoch 7/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7109 - mae: 0.7108 - root_mean_squared_error: 0.9218 - val_loss: 0.8005 - val_mae: 0.8004 - val_root_mean_squared_error: 0.9843 Epoch 8/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7059 - mae: 0.7058 - root_mean_squared_error: 0.9194 - val_loss: 0.7858 - val_mae: 0.7857 - val_root_mean_squared_error: 0.9786 Epoch 9/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7070 - mae: 0.7069 - root_mean_squared_error: 0.9226 - val_loss: 0.7936 - val_mae: 0.7934 - val_root_mean_squared_error: 0.9821 Epoch 10/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7033 - mae: 0.7032 - root_mean_squared_error: 0.9193 - val_loss: 0.7864 - val_mae: 0.7862 - val_root_mean_squared_error: 0.9809 Epoch 11/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6951 - mae: 0.6949 - root_mean_squared_error: 0.9136 - val_loss: 0.7898 - val_mae: 0.7897 - val_root_mean_squared_error: 0.9847 Epoch 12/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6893 - mae: 0.6891 - root_mean_squared_error: 0.9115 - val_loss: 0.7850 - val_mae: 0.7848 - val_root_mean_squared_error: 0.9804 Epoch 13/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6880 - mae: 0.6878 - root_mean_squared_error: 0.9106 - val_loss: 0.7868 - val_mae: 0.7866 - val_root_mean_squared_error: 0.9833 Epoch 14/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6852 - mae: 0.6850 - root_mean_squared_error: 0.9132 - val_loss: 0.7890 - val_mae: 0.7887 - val_root_mean_squared_error: 0.9853 Epoch 15/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6806 - mae: 0.6804 - root_mean_squared_error: 0.9071 - val_loss: 0.7856 - val_mae: 0.7854 - val_root_mean_squared_error: 0.9848 Epoch 16/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6798 - mae: 0.6795 - root_mean_squared_error: 0.9081 - val_loss: 0.7861 - val_mae: 0.7858 - val_root_mean_squared_error: 0.9845 Epoch 17/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6766 - mae: 0.6763 - root_mean_squared_error: 0.9098 - val_loss: 0.7895 - val_mae: 0.7892 - val_root_mean_squared_error: 0.9862 Epoch 18/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6735 - mae: 0.6732 - root_mean_squared_error: 0.9047 - val_loss: 0.7860 - val_mae: 0.7858 - val_root_mean_squared_error: 0.9855 Epoch 19/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6682 - mae: 0.6679 - root_mean_squared_error: 0.9003 - val_loss: 0.7887 - val_mae: 0.7884 - val_root_mean_squared_error: 0.9847 Epoch 20/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6677 - mae: 0.6674 - root_mean_squared_error: 0.9005 - val_loss: 0.7834 - val_mae: 0.7831 - val_root_mean_squared_error: 0.9834 Epoch 21/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6660 - mae: 0.6656 - root_mean_squared_error: 0.8989 - val_loss: 0.7853 - val_mae: 0.7850 - val_root_mean_squared_error: 0.9852 Epoch 22/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6625 - mae: 0.6621 - root_mean_squared_error: 0.8974 - val_loss: 0.7889 - val_mae: 0.7885 - val_root_mean_squared_error: 0.9897 Epoch 23/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6622 - mae: 0.6618 - root_mean_squared_error: 0.8964 - val_loss: 0.7851 - val_mae: 0.7848 - val_root_mean_squared_error: 0.9856 Epoch 24/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6611 - mae: 0.6607 - root_mean_squared_error: 0.8972 - val_loss: 0.7880 - val_mae: 0.7876 - val_root_mean_squared_error: 0.9870 Epoch 25/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6600 - mae: 0.6596 - root_mean_squared_error: 0.8963 - val_loss: 0.7877 - val_mae: 0.7873 - val_root_mean_squared_error: 0.9894 Total training time: 224.4608154296875 sec
plot_model(deepfm_ctr_model_1, "deepfm_ctr.png", show_shapes=True)
plot_history(deepfm_ctr_model_1_history, plot_title='DeepCTR DeepFM (Gender, Age, Generes)')
test_input = {name:X_test_map[name] for name in feature_names}
loss_metrics = deepfm_ctr_model_1.evaluate(test_input, y_test)
_, deepfm_ctr_model_1_mae, deepfm_ctr_model_1_rmse = loss_metrics
print(f'MAE: {deepfm_ctr_model_1_mae}')
print(f'RMSE: {deepfm_ctr_model_1_rmse}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7302 - mae: 0.7298 - root_mean_squared_error: 0.9550 MAE: 0.7297515869140625 RMSE: 0.9549513459205627
prediction = deepfm_ctr_model_1.predict(test_input, batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 4.128789 |
| 1 | 0 | 9 | 3 | 875693118 | 4.801704 |
| 2 | 0 | 11 | 5 | 878542960 | 4.861095 |
| 3 | 0 | 13 | 5 | 874965706 | 4.861135 |
| 4 | 0 | 16 | 3 | 875073198 | 3.233257 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.004219 |
| 19996 | 457 | 1100 | 4 | 886397931 | 4.071216 |
| 19997 | 458 | 933 | 3 | 879563639 | 3.239141 |
| 19998 | 459 | 9 | 3 | 882912371 | 4.124553 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.927983 |
20000 rows × 5 columns
mprk_deepfm_ctr_1 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='DeepCTR DeepFM (Gender, Age, Generes)')
We will try diffrent combination of user and movie features to see what works best.
Use DeepCTR DeepFM model and ignore users age
fixlen_feature_columns = [SparseFeat("user_id", rating_user_count, embedding_dim=20),
SparseFeat("movie_id", rated_movie_count, embedding_dim=20),
SparseFeat("gender", 2, embedding_dim=20)]
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres',vocabulary_size=num_generes, embedding_dim=20), maxlen=max_movie_genres, combiner='mean', weight_name=None)] # Notice : value 0 is for padding for sequence input feature
deepfm_ctr_model_2, feature_names = create_deepfm_ctr_model(fixlen_feature_columns, varlen_feature_columns)
train_input = {name:X_train_map[name] for name in feature_names}
train_time_callback_3, callbacks = create_callbacks('deepfm_ctr_model')
deepfm_ctr_model_2_history = deepfm_ctr_model_2.fit(train_input, y_train, batch_size=32, epochs=25, verbose=1, validation_split=0.1, callbacks=callbacks )
Epoch 1/25 2250/2250 [==============================] - 10s 4ms/step - loss: 1.4665 - mae: 1.4665 - root_mean_squared_error: 1.6690 - val_loss: 0.8365 - val_mae: 0.8365 - val_root_mean_squared_error: 1.0023 Epoch 2/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7315 - mae: 0.7315 - root_mean_squared_error: 0.9371 - val_loss: 0.8381 - val_mae: 0.8381 - val_root_mean_squared_error: 1.0014 Epoch 3/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7228 - mae: 0.7228 - root_mean_squared_error: 0.9309 - val_loss: 0.8161 - val_mae: 0.8160 - val_root_mean_squared_error: 0.9862 Epoch 4/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7199 - mae: 0.7198 - root_mean_squared_error: 0.9288 - val_loss: 0.7976 - val_mae: 0.7976 - val_root_mean_squared_error: 0.9744 Epoch 5/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7183 - mae: 0.7182 - root_mean_squared_error: 0.9283 - val_loss: 0.7925 - val_mae: 0.7925 - val_root_mean_squared_error: 0.9731 Epoch 6/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7161 - mae: 0.7161 - root_mean_squared_error: 0.9262 - val_loss: 0.7845 - val_mae: 0.7844 - val_root_mean_squared_error: 0.9701 Epoch 7/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7103 - mae: 0.7102 - root_mean_squared_error: 0.9236 - val_loss: 0.7873 - val_mae: 0.7872 - val_root_mean_squared_error: 0.9739 Epoch 8/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7054 - mae: 0.7053 - root_mean_squared_error: 0.9180 - val_loss: 0.7881 - val_mae: 0.7880 - val_root_mean_squared_error: 0.9747 Epoch 9/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7013 - mae: 0.7011 - root_mean_squared_error: 0.9138 - val_loss: 0.7862 - val_mae: 0.7861 - val_root_mean_squared_error: 0.9759 Epoch 10/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6960 - mae: 0.6959 - root_mean_squared_error: 0.9105 - val_loss: 0.7792 - val_mae: 0.7790 - val_root_mean_squared_error: 0.9731 Epoch 11/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6938 - mae: 0.6936 - root_mean_squared_error: 0.9141 - val_loss: 0.7771 - val_mae: 0.7769 - val_root_mean_squared_error: 0.9741 Epoch 12/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6907 - mae: 0.6905 - root_mean_squared_error: 0.9138 - val_loss: 0.7802 - val_mae: 0.7800 - val_root_mean_squared_error: 0.9762 Epoch 13/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6837 - mae: 0.6835 - root_mean_squared_error: 0.9076 - val_loss: 0.7860 - val_mae: 0.7858 - val_root_mean_squared_error: 0.9835 Epoch 14/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6812 - mae: 0.6810 - root_mean_squared_error: 0.9079 - val_loss: 0.7781 - val_mae: 0.7778 - val_root_mean_squared_error: 0.9796 Epoch 15/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6788 - mae: 0.6786 - root_mean_squared_error: 0.9085 - val_loss: 0.7880 - val_mae: 0.7878 - val_root_mean_squared_error: 0.9838 Epoch 16/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6780 - mae: 0.6778 - root_mean_squared_error: 0.9085 - val_loss: 0.7810 - val_mae: 0.7807 - val_root_mean_squared_error: 0.9799 Epoch 17/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6732 - mae: 0.6729 - root_mean_squared_error: 0.9048 - val_loss: 0.7822 - val_mae: 0.7819 - val_root_mean_squared_error: 0.9821 Epoch 18/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6714 - mae: 0.6711 - root_mean_squared_error: 0.9047 - val_loss: 0.7840 - val_mae: 0.7837 - val_root_mean_squared_error: 0.9839 Epoch 19/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6639 - mae: 0.6636 - root_mean_squared_error: 0.8933 - val_loss: 0.7778 - val_mae: 0.7775 - val_root_mean_squared_error: 0.9819 Epoch 20/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6602 - mae: 0.6599 - root_mean_squared_error: 0.8941 - val_loss: 0.7793 - val_mae: 0.7790 - val_root_mean_squared_error: 0.9817 Epoch 21/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6617 - mae: 0.6614 - root_mean_squared_error: 0.8995 - val_loss: 0.7788 - val_mae: 0.7784 - val_root_mean_squared_error: 0.9819 Total training time: 174.46713733673096 sec
plot_history(deepfm_ctr_model_2_history, plot_title='DeepCTR DeepFM (Age, Generes)')
test_input = {name:X_test_map[name] for name in feature_names}
loss_metrics = deepfm_ctr_model_2.evaluate(test_input, y_test)
_, deepfm_ctr_model_2_mae, deepfm_ctr_model_2_rmse = loss_metrics
print(f'MAE: {deepfm_ctr_model_2_mae}')
print(f'RMSE: {deepfm_ctr_model_2_rmse}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7321 - mae: 0.7318 - root_mean_squared_error: 0.9596 MAE: 0.7317923903465271 RMSE: 0.9596238136291504
prediction = deepfm_ctr_model_2.predict(test_input, batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 4.238823 |
| 1 | 0 | 9 | 3 | 875693118 | 4.370886 |
| 2 | 0 | 11 | 5 | 878542960 | 4.997692 |
| 3 | 0 | 13 | 5 | 874965706 | 4.890097 |
| 4 | 0 | 16 | 3 | 875073198 | 3.281315 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.023106 |
| 19996 | 457 | 1100 | 4 | 886397931 | 4.053744 |
| 19997 | 458 | 933 | 3 | 879563639 | 3.347976 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.686494 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.819952 |
20000 rows × 5 columns
mprk_deepfm_ctr_2 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='DeepCTR DeepFM (Gender, Generes)')
Ignore movie categorical genre data
fixlen_feature_columns = [SparseFeat("user_id", rating_user_count, embedding_dim=20),
SparseFeat("movie_id", rated_movie_count, embedding_dim=20),
SparseFeat("gender", 2, embedding_dim=20),
SparseFeat("age", max_age, embedding_dim=20)]
varlen_feature_columns = []
deepfm_ctr_model_3, feature_names = create_deepfm_ctr_model(fixlen_feature_columns, varlen_feature_columns)
train_input = {name:X_train_map[name] for name in feature_names}
train_time_callback_4, callbacks = create_callbacks('deepfm_ctr_model')
deepfm_ctr_model_3_history = deepfm_ctr_model_3.fit(train_input, y_train, batch_size=32, epochs=25, verbose=1, validation_split=0.1, callbacks=callbacks )
Epoch 1/25 2250/2250 [==============================] - 10s 4ms/step - loss: 1.5035 - mae: 1.5035 - root_mean_squared_error: 1.7045 - val_loss: 0.8404 - val_mae: 0.8403 - val_root_mean_squared_error: 1.0099 Epoch 2/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7330 - mae: 0.7329 - root_mean_squared_error: 0.9386 - val_loss: 0.8322 - val_mae: 0.8321 - val_root_mean_squared_error: 1.0006 Epoch 3/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7224 - mae: 0.7224 - root_mean_squared_error: 0.9303 - val_loss: 0.8375 - val_mae: 0.8375 - val_root_mean_squared_error: 1.0054 Epoch 4/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7209 - mae: 0.7208 - root_mean_squared_error: 0.9286 - val_loss: 0.8244 - val_mae: 0.8243 - val_root_mean_squared_error: 0.9979 Epoch 5/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7142 - mae: 0.7141 - root_mean_squared_error: 0.9226 - val_loss: 0.8011 - val_mae: 0.8010 - val_root_mean_squared_error: 0.9811 Epoch 6/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7070 - mae: 0.7069 - root_mean_squared_error: 0.9174 - val_loss: 0.8007 - val_mae: 0.8006 - val_root_mean_squared_error: 0.9803 Epoch 7/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7039 - mae: 0.7038 - root_mean_squared_error: 0.9144 - val_loss: 0.7889 - val_mae: 0.7888 - val_root_mean_squared_error: 0.9752 Epoch 8/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6986 - mae: 0.6985 - root_mean_squared_error: 0.9157 - val_loss: 0.7910 - val_mae: 0.7909 - val_root_mean_squared_error: 0.9788 Epoch 9/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6892 - mae: 0.6890 - root_mean_squared_error: 0.9062 - val_loss: 0.8007 - val_mae: 0.8005 - val_root_mean_squared_error: 0.9867 Epoch 10/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6821 - mae: 0.6819 - root_mean_squared_error: 0.9034 - val_loss: 0.7860 - val_mae: 0.7859 - val_root_mean_squared_error: 0.9798 Epoch 11/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6828 - mae: 0.6826 - root_mean_squared_error: 0.9069 - val_loss: 0.7894 - val_mae: 0.7892 - val_root_mean_squared_error: 0.9843 Epoch 12/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6752 - mae: 0.6750 - root_mean_squared_error: 0.9047 - val_loss: 0.7870 - val_mae: 0.7868 - val_root_mean_squared_error: 0.9835 Epoch 13/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6722 - mae: 0.6720 - root_mean_squared_error: 0.9054 - val_loss: 0.7831 - val_mae: 0.7828 - val_root_mean_squared_error: 0.9819 Epoch 14/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6697 - mae: 0.6695 - root_mean_squared_error: 0.9019 - val_loss: 0.7881 - val_mae: 0.7879 - val_root_mean_squared_error: 0.9873 Epoch 15/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6698 - mae: 0.6696 - root_mean_squared_error: 0.9014 - val_loss: 0.7899 - val_mae: 0.7896 - val_root_mean_squared_error: 0.9866 Epoch 16/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6627 - mae: 0.6625 - root_mean_squared_error: 0.8969 - val_loss: 0.7872 - val_mae: 0.7869 - val_root_mean_squared_error: 0.9871 Epoch 17/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6593 - mae: 0.6590 - root_mean_squared_error: 0.8936 - val_loss: 0.7865 - val_mae: 0.7863 - val_root_mean_squared_error: 0.9840 Epoch 18/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6557 - mae: 0.6554 - root_mean_squared_error: 0.8919 - val_loss: 0.7856 - val_mae: 0.7853 - val_root_mean_squared_error: 0.9871 Epoch 19/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6538 - mae: 0.6535 - root_mean_squared_error: 0.8908 - val_loss: 0.7957 - val_mae: 0.7954 - val_root_mean_squared_error: 0.9982 Epoch 20/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6458 - mae: 0.6455 - root_mean_squared_error: 0.8840 - val_loss: 0.7949 - val_mae: 0.7945 - val_root_mean_squared_error: 0.9985 Epoch 21/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6426 - mae: 0.6423 - root_mean_squared_error: 0.8808 - val_loss: 0.7884 - val_mae: 0.7880 - val_root_mean_squared_error: 0.9932 Epoch 22/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6408 - mae: 0.6404 - root_mean_squared_error: 0.8776 - val_loss: 0.7872 - val_mae: 0.7868 - val_root_mean_squared_error: 0.9957 Epoch 23/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6352 - mae: 0.6348 - root_mean_squared_error: 0.8751 - val_loss: 0.7843 - val_mae: 0.7839 - val_root_mean_squared_error: 0.9912 Total training time: 191.489972114563 sec
plot_history(deepfm_ctr_model_3_history, plot_title='DeepCTR DeepFM (Gender, Age)')
test_input = {name:X_test_map[name] for name in feature_names}
loss_metrics = deepfm_ctr_model_3.evaluate(test_input, y_test)
_, deepfm_ctr_model_3_mae, deepfm_ctr_model_3_rmse = loss_metrics
print(f'MAE: {deepfm_ctr_model_3_mae}')
print(f'RMSE: {deepfm_ctr_model_3_rmse}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7281 - mae: 0.7277 - root_mean_squared_error: 0.9552 MAE: 0.7276912927627563 RMSE: 0.9552035927772522
prediction = deepfm_ctr_model_3.predict(test_input, batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 4.140345 |
| 1 | 0 | 9 | 3 | 875693118 | 4.304011 |
| 2 | 0 | 11 | 5 | 878542960 | 4.887478 |
| 3 | 0 | 13 | 5 | 874965706 | 4.629174 |
| 4 | 0 | 16 | 3 | 875073198 | 2.993731 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.032073 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.980045 |
| 19997 | 458 | 933 | 3 | 879563639 | 3.280157 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.728464 |
| 19999 | 461 | 681 | 5 | 886365231 | 4.113528 |
20000 rows × 5 columns
mprk_deepfm_ctr_3 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='DeepCTR DeepFM (Gender, Age)')
Use only movie genre data
fixlen_feature_columns = [SparseFeat("user_id", rating_user_count, embedding_dim=20),
SparseFeat("movie_id", rated_movie_count, embedding_dim=20)]
varlen_feature_columns = [VarLenSparseFeat(SparseFeat('genres',vocabulary_size=num_generes, embedding_dim=20), maxlen=max_movie_genres, combiner='mean', weight_name=None)] # Notice : value 0 is for padding for sequence input feature
deepfm_ctr_model_4, feature_names = create_deepfm_ctr_model(fixlen_feature_columns, varlen_feature_columns)
train_input = {name:X_train_map[name] for name in feature_names}
train_time_callback_5, callbacks = create_callbacks('deepfm_ctr_model')
deepfm_ctr_model_4_history = deepfm_ctr_model_4.fit(train_input, y_train, batch_size=32, epochs=25, verbose=1, validation_split=0.1, callbacks=callbacks )
Epoch 1/25 2250/2250 [==============================] - 10s 4ms/step - loss: 1.5556 - mae: 1.5556 - root_mean_squared_error: 1.7565 - val_loss: 0.8374 - val_mae: 0.8373 - val_root_mean_squared_error: 1.0034 Epoch 2/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7335 - mae: 0.7335 - root_mean_squared_error: 0.9405 - val_loss: 0.8659 - val_mae: 0.8659 - val_root_mean_squared_error: 1.0281 Epoch 3/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7258 - mae: 0.7258 - root_mean_squared_error: 0.9341 - val_loss: 0.8260 - val_mae: 0.8259 - val_root_mean_squared_error: 0.9931 Epoch 4/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7211 - mae: 0.7211 - root_mean_squared_error: 0.9307 - val_loss: 0.8117 - val_mae: 0.8116 - val_root_mean_squared_error: 0.9848 Epoch 5/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7191 - mae: 0.7190 - root_mean_squared_error: 0.9293 - val_loss: 0.8133 - val_mae: 0.8132 - val_root_mean_squared_error: 0.9855 Epoch 6/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7151 - mae: 0.7150 - root_mean_squared_error: 0.9254 - val_loss: 0.8149 - val_mae: 0.8148 - val_root_mean_squared_error: 0.9905 Epoch 7/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.7099 - mae: 0.7098 - root_mean_squared_error: 0.9238 - val_loss: 0.8123 - val_mae: 0.8122 - val_root_mean_squared_error: 0.9889 Epoch 8/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.7037 - mae: 0.7036 - root_mean_squared_error: 0.9176 - val_loss: 0.7994 - val_mae: 0.7993 - val_root_mean_squared_error: 0.9779 Epoch 9/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6973 - mae: 0.6972 - root_mean_squared_error: 0.9137 - val_loss: 0.8028 - val_mae: 0.8026 - val_root_mean_squared_error: 0.9856 Epoch 10/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6904 - mae: 0.6903 - root_mean_squared_error: 0.9078 - val_loss: 0.7883 - val_mae: 0.7882 - val_root_mean_squared_error: 0.9752 Epoch 11/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6822 - mae: 0.6820 - root_mean_squared_error: 0.9027 - val_loss: 0.7982 - val_mae: 0.7980 - val_root_mean_squared_error: 0.9828 Epoch 12/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6790 - mae: 0.6788 - root_mean_squared_error: 0.9064 - val_loss: 0.7950 - val_mae: 0.7948 - val_root_mean_squared_error: 0.9846 Epoch 13/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6748 - mae: 0.6746 - root_mean_squared_error: 0.9006 - val_loss: 0.7852 - val_mae: 0.7850 - val_root_mean_squared_error: 0.9800 Epoch 14/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6702 - mae: 0.6700 - root_mean_squared_error: 0.8974 - val_loss: 0.7841 - val_mae: 0.7839 - val_root_mean_squared_error: 0.9773 Epoch 15/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6673 - mae: 0.6670 - root_mean_squared_error: 0.8979 - val_loss: 0.7770 - val_mae: 0.7768 - val_root_mean_squared_error: 0.9738 Epoch 16/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6643 - mae: 0.6640 - root_mean_squared_error: 0.8991 - val_loss: 0.7846 - val_mae: 0.7843 - val_root_mean_squared_error: 0.9795 Epoch 17/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6616 - mae: 0.6614 - root_mean_squared_error: 0.8974 - val_loss: 0.7793 - val_mae: 0.7790 - val_root_mean_squared_error: 0.9766 Epoch 18/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6581 - mae: 0.6578 - root_mean_squared_error: 0.8965 - val_loss: 0.7774 - val_mae: 0.7771 - val_root_mean_squared_error: 0.9761 Epoch 19/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6545 - mae: 0.6542 - root_mean_squared_error: 0.8907 - val_loss: 0.7784 - val_mae: 0.7780 - val_root_mean_squared_error: 0.9801 Epoch 20/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6522 - mae: 0.6519 - root_mean_squared_error: 0.8906 - val_loss: 0.7822 - val_mae: 0.7819 - val_root_mean_squared_error: 0.9788 Epoch 21/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6471 - mae: 0.6467 - root_mean_squared_error: 0.8847 - val_loss: 0.7795 - val_mae: 0.7792 - val_root_mean_squared_error: 0.9813 Epoch 22/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6436 - mae: 0.6432 - root_mean_squared_error: 0.8849 - val_loss: 0.7805 - val_mae: 0.7802 - val_root_mean_squared_error: 0.9796 Epoch 23/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6424 - mae: 0.6420 - root_mean_squared_error: 0.8838 - val_loss: 0.7757 - val_mae: 0.7753 - val_root_mean_squared_error: 0.9807 Epoch 24/25 2250/2250 [==============================] - 8s 4ms/step - loss: 0.6346 - mae: 0.6342 - root_mean_squared_error: 0.8751 - val_loss: 0.7765 - val_mae: 0.7761 - val_root_mean_squared_error: 0.9816 Epoch 25/25 2250/2250 [==============================] - 9s 4ms/step - loss: 0.6364 - mae: 0.6360 - root_mean_squared_error: 0.8805 - val_loss: 0.7712 - val_mae: 0.7707 - val_root_mean_squared_error: 0.9770 Total training time: 210.9014608860016 sec
plot_history(deepfm_ctr_model_4_history, plot_title='DeepCTR DeepFM (Generes)')
test_input = {name:X_test_map[name] for name in feature_names}
loss_metrics = deepfm_ctr_model_4.evaluate(test_input, y_test)
_, deepfm_ctr_model_4_mae, deepfm_ctr_model_4_rmse = loss_metrics
print(f'MAE: {deepfm_ctr_model_4_mae}')
print(f'RMSE: {deepfm_ctr_model_4_rmse}')
625/625 [==============================] - 1s 2ms/step - loss: 0.7285 - mae: 0.7281 - root_mean_squared_error: 0.9581 MAE: 0.7281006574630737 RMSE: 0.9581118226051331
prediction = deepfm_ctr_model_4.predict(test_input, batch_size=20)
test_pred = ratings_test.assign(rating_pred=prediction)
test_pred
| user_id | movie_id | rating | unix_timestamp | rating_pred | |
|---|---|---|---|---|---|
| 0 | 0 | 5 | 5 | 887431973 | 5.061394 |
| 1 | 0 | 9 | 3 | 875693118 | 4.513389 |
| 2 | 0 | 11 | 5 | 878542960 | 4.991316 |
| 3 | 0 | 13 | 5 | 874965706 | 4.792376 |
| 4 | 0 | 16 | 3 | 875073198 | 3.545493 |
| ... | ... | ... | ... | ... | ... |
| 19995 | 457 | 647 | 4 | 886395899 | 4.004834 |
| 19996 | 457 | 1100 | 4 | 886397931 | 3.993174 |
| 19997 | 458 | 933 | 3 | 879563639 | 3.259114 |
| 19998 | 459 | 9 | 3 | 882912371 | 3.994706 |
| 19999 | 461 | 681 | 5 | 886365231 | 3.985984 |
20000 rows × 5 columns
mprk_deepfm_ctr_4 = plot_model_precision_recall_at_k(test_pred, k=[3, 5, 10], plot_title='DeepCTR DeepFM (Generes)')
DeepFM models comparison
plot_model_loss_comparison(losses={'Keras DeepFM (Gender, Age, Generes)': {'mae': deepfm_model_mae, 'rmse': deepfm_model_rmse},
'DeepCTR DeepFM (Gender, Age, Generes)': {'mae': deepfm_ctr_model_1_mae, 'rmse': deepfm_ctr_model_1_rmse},
'DeepCTR DeepFM (Gender, Generes)': {'mae': deepfm_ctr_model_2_mae, 'rmse': deepfm_ctr_model_2_rmse},
'DeepCTR DeepFM (Gender, Age)': {'mae': deepfm_ctr_model_3_mae, 'rmse': deepfm_ctr_model_3_rmse},
'DeepCTR DeepFM (Generes)': {'mae': deepfm_ctr_model_4_mae, 'rmse': deepfm_ctr_model_4_rmse},
},
plot_title='DeepFM models comparison', high=True)
plot_model_train_time_comparison(times={'Keras DeepFM (Gender, Age, Generes)': train_time_callback_1.train_time_end,
'DeepCTR DeepFM (Gender, Age, Generes)': train_time_callback_2.train_time_end,
'DeepCTR DeepFM (Gender, Generes)': train_time_callback_3.train_time_end,
'DeepCTR DeepFM (Gender, Age)': train_time_callback_4.train_time_end,
'DeepCTR DeepFM (Generes)': train_time_callback_5.train_time_end,}, plot_title='DeepFM models Training time comparison')
Overall models comparison
plot_model_loss_comparison(losses={'Baseline All': {'mae': basemodel_loss["MAE"], 'rmse': basemodel_loss["RMSE"]},
'TuriCreate Pearson Correlation': {'mae': mae(test_data["rating"], y_pred_pearson), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_pearson))},
'TuriCreate Matrix Facorization': {'mae': mae(test_data["rating"], y_pred_mf), 'rmse': rmse(np.array(test_data["rating"]), np.array(y_pred_mf))},
'NCF Model V1': {'mae': loss_metrics_ncf_1[1], 'rmse': loss_metrics_ncf_1[2]},
'NCF Model V2': {'mae': loss_metrics_ncf_2[1], 'rmse': loss_metrics_ncf_2[2]},
'NCF Model V3': {'mae': loss_metrics_ncf_3[1], 'rmse': loss_metrics_ncf_3[2]},
'NCF Model V4': {'mae': loss_metrics_ncf_4[1], 'rmse': loss_metrics_ncf_4[2]},
'Keras DeepFM (Gender, Age, Generes)': {'mae': deepfm_model_mae, 'rmse': deepfm_model_rmse},
'DeepCTR DeepFM (Gender, Age, Generes)': {'mae': deepfm_ctr_model_1_mae, 'rmse': deepfm_ctr_model_4_rmse},
'DeepCTR DeepFM (Gender, Generes)': {'mae': deepfm_ctr_model_2_mae, 'rmse': deepfm_ctr_model_2_rmse},
},
plot_title='Recommender Models comparison', high=True)
plot_model_precision_recall_at_k_comparison(data={
'Baseline All': mprk_baseline_all,
'TuriCreate Pearson Correlation': mprk_pearson,
'TuriCreate Matrix Facorization': mprk_mf,
'NCF Model V1': mprk_ncf_1,
'NCF Model V2': mprk_ncf_2,
'NCF Model V3': mprk_ncf_3,
'NCF Model V4': mprk_ncf_4,
'Keras DeepFM (Gender, Age, Generes)': mprk_deepfm_keras,
'DeepCTR DeepFM (Gender, Age, Generes)': mprk_deepfm_ctr_1,
'DeepCTR DeepFM (Gender, Generes)': mprk_deepfm_ctr_2,
},
plot_title='Recommender Models comparison', wide=True)
We can see that overall the more complicated models we used have no substantial differences in terms of loss MAE/RMSE and in terms of Percision/Recall @ K, therefore, and due to the persitency over multiple runs we would recommend to use the DeepCTR based DeepFM model by using the additional features of user Gender and Movie Genres in order to predict rating for the MovieLens 100K dataset.